Preface

I don’t know under what context you might be looking at this html, but hello I’m Tim. I’m writing this preface so that no matter where you’re coming from I can explain what you’re about to see.

What is this?

This is a data analysis I completed on some survey data I collected from a Facebook group known as (Sandy) Alex G 666posting. For those of you that don’t know, Alex G is a prolific indie artist who has been putting out music since at least 2010. His popularity has only continued to grow with each album release, but what makes him such an interesting musician is, including his leaked unreleased tracks, Alex has 200+ songs, which to me is impressive.

I joined the group back in Jan of 2019 as I got into Alex’s unreleased discography and have come to really enjoy the community that exists there. One day an individual, Zion, asked how old they were when they started listening to Alex G and how old they were now. I asked if it was okay, collected that data, and made a figure for the group showing the distribution of ages and other ones as well. I then asked if it was okay to make a survey for the group to analyze more data, I got the okay, specifically from Isabelle, and began working. Many individuals helped with the survey as you can see in the bottom, but the result was 38 columns of data from 211 individuals. Some questions Alex G related, some not. But I hope you find something interesting in this little project I made. I’ve been working on it on and off in my free time since late April. Either way, its a labor of love to give back to the great community I found online. The group is full of cool people and I’m glad to be a part of it (I also think this project helped me become a mod admin so that was nice).

Disclaimer

At the time of me writing this 666posting is at the cusp of reaching 3,000 members. That’s great, but we need to talk about sample size because that means we only have ~7.0333333% of the group accounted for. Not only that but as this is an optional survey, one must be aware of the type of people who would take the time to fill out the said survey. This is not an unbiased slice of the group, but I do like to think this includes the core group of active individuals. So with that in mind, let’s get into the analysis.

Libraries used

  • tidyverse: Used for general exploratory analysis, primarily used dplyr within it
  • ztable: Used to make various tables
  • UpSetR: Used to make Upset plots
  • wordcloud: Used to make word clouds (surprise!)
  • rworldmap: The map of the world!
  • mapproj: The map of… US?
  • wesanderson: Wes Anderson Color Palettes
  • maps: More maps
  • viridis: Used for colorblind-friendly palette
  • rstatix: Stat tests
  • EnvStats: Used to get the N on ggplots for groupings
  • ggpubr: Arranging the plots
  • spotifyr: Getting that spotify data!
  • scales: Percent axis
  • reshape2: Data melting
  • ggforce: Sina Plots

Seed

set.seed(123) This was just to keep the consistency of the word clouds

## Loading the data and prepping functions


#functions
hi_low_fact <- function(x, decrease = F){
  names(table(x)[order(table(x), decreasing = decrease)])
}

auto_ztable <- function(df, col1, col2, title, simul_p = F, pval_digits = 3, marg = 0) {
  temp_tab <- table(df[[col1]], df[[col2]])
  temp_ztab <- ztable(temp_tab)
  chi_result <- chisq.test(df[[col1]], df[[col2]], simulate.p.value = simul_p)
  temp_ztab %>% makeHeatmap(margin = marg) %>%
  print(caption = paste0(title," ; Chi-Sqare p-value of ", format.pval(chi_result$p.value, digits = pval_digits)))
}

to_zcore <- function(x){
  (x - mean(x))/sd(x)
}

Demographics Analysis

dat <- dat %>% mutate(
  race = str_to_title(race),
  orientation = str_to_title(orientation),
  gender = factor(gender, hi_low_fact(gender, decrease = F)),
  transgender = factor(transgender, levels = c("No", "Yes", "Prefer not to answer")),
  polyamorous = ifelse(polyamorous == "Prefer not to say", "Prefer not to answer", polyamorous),
  polyamorous = factor(polyamorous, levels = c("No", "Yes", "Prefer not to answer")),
  generation = ifelse(generation == "", "Prefer not to say", generation)
)

So the first demographics we’ll take a quick peek at is at age and race. First, we’ll look at age at a histogram colored on “what generation do you identify with”. Get used to the age variable, as it is one of our few numerical values, so I’m going to be plotting it a lot. Anyways, overall age is right-skewed, with only one real outlier at 42. Alex G is about 26-27 at the time of me writing this and has been writing music since his early teenage years, so the demographics of teens and 20-somethings is unsurprising.

As a straight white dude (who we will see is the average Alex G listener), I can’t really speak to why or why not Alex is popular or not popular with the other demographics, and frankly I’m not sure it is my place to speculate. As a result, for race, gender, trans identity, and polyamory I will let you draw your own conclusions. I just don’t feel I as an individual should be speaking didactically about such a subject.

#dat <- read.csv("inputs/alex_g_stats_9_24.csv", stringsAsFactors = F)

#Age
age_plot <- ggplot(dat, aes(age, fill = generation)) +
  geom_histogram(breaks=seq(15, 45, by=1), color = "black")+
  ylab("Count") +
  xlab("Age") +
  theme_classic() +
  scale_x_continuous(breaks = round(seq(min(dat$age, na.rm = T), max(dat$age, na.rm = T), by = 5)/5) * 5)

age_summary <- round(t(as.matrix(summary(dat$age))), digits = 2)

rownames(age_summary) <- "Age"

age_summ_table <- ggtexttable(age_summary)


#Race plot
race_dat <- dat %>% count(race)


race_plot <- ggplot(race_dat, aes(race, n)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme_classic() +
  geom_text(aes(label=n), nudge_y = 2.5) +
  ylab("Count") +
  xlab("Race")



  
  
dem_plot <- ggarrange(age_plot, age_summ_table, race_plot, nrow = 3,
                      heights = c(1, .3, 1))

annotate_figure(dem_plot, top = text_grob(
  "Age and Race of individuals in 666posting"
))

#gender plot
gender_plot <- ggplot(dat, aes(gender, fill = gender)) +
  geom_bar() +
  coord_flip() +
  theme_classic() +
  theme(legend.position="bottom") +
  geom_text(stat='count', aes(label=..count..), nudge_y = 2.5)
  
#gender_plot

#trans identity plot
trans_dat <- filter(dat, transgender != "") %>% select(age, transgender)
trans_plot <-  ggplot(trans_dat, aes(transgender)) + geom_bar() +
  coord_flip()+
  theme_classic() +
  geom_text(stat='count', aes(label=..count..), nudge_y = 4)

#trans_plot

#orientation plot
orientation_totals <- dat %>% group_by(orientation) %>%
  count(orientation)


orientation_plot <-  ggplot(dat, aes(orientation, fill = gender)) +
  geom_bar(position = "dodge") +
  coord_flip() +
  theme_classic() +
  theme(legend.position="bottom")+
  geom_text(stat='count',
            aes(label=..count..),
            position = position_dodge(width = 1),
            hjust = -.08,
            size = 4)
    #geom_text(aes(orientation, n, label = n, fill = NULL), data = orientation_totals, nudge_y = 4)

#polyamorous identity plot
poly_totals <- dat %>% group_by(polyamorous) %>%
  count(polyamorous)

poly_plot <-  ggplot(dat, aes(polyamorous, fill = gender)) +
  geom_bar(position = "Dodge") +
  coord_flip() +
  theme(legend.position = "none") +
  theme_classic() +
  geom_text(stat='count', aes(label=..count..), position = position_dodge(width = 1), hjust = -.08)
 #   geom_text(aes(polyamorous, n, label = n, fill = NULL), data = poly_totals, nudge_y = 4)

#poly_plot


gpt_plot <- ggarrange(gender_plot, orientation_plot,
                      poly_plot, trans_plot,
                      nrow = 4, common.legend = T,
                      heights = c(.4, 1.5, 1, .8))


annotate_figure(gpt_plot, top = text_grob(
  "Gender count, polyamory colored on gender, and trans identity"
))

The Original Plot

This is a plot I had inspired this project, how? Quick storytime:

Zion, a member of 666posting asked individuals “What age were you when you discovered Alex G and what age are you now?”. Everyone started answering the question and I found it interesting. Eventually, it had over 200+ individuals and I came up with an idea on making two overlapping histograms of people ages of learning about Alex G and their current age. I asked for permission and then I began the first analysis of 666posting. As you’ll see I didn’t stop there with that data since I thought of other things to do with it, but this eventually led to me asking if I could run a survey which leads to this whole R markdown. So really Zion inspired me to do this, so if you’re reading this, thanks dude!

## One of the data points I make sure to make is the year learned people learned about Alex G, using the fact this survey was done in 2020 as a reference. This will be used later


##Two histograms

dat <- dat %>% mutate(difference = age - first_age, year_learned = 2020 - difference)

#melt_dat <- melt(dat %>% select(-difference, -year_learned))
melt_dat <- melt(dat %>% select(age, first_age))



#get mean
start_mean <- mean(dat$first_age)
start_median <- median(dat$first_age)

end_mean <- mean(dat$age)
end_median <- median(dat$age)

sum_dat <- NULL
sum_dat$mean <- c(end_mean, start_mean)
sum_dat$median <- c(end_median, start_median)
sum_dat$group <- factor(c("end", "start"), levels = c("start", "end"))
sum_dat <- as.data.frame(sum_dat)

# Interweaved histograms
ggplot(melt_dat, aes(x=value, fill=variable)) +
  geom_histogram(position="identity", alpha=0.5, color = "black" ,
                 #bins = (max(melt_dat$value) - min(melt_dat$value)
                  binwidth = 1)+
  
  scale_x_continuous(breaks=seq(min(melt_dat$value),max(melt_dat$value),2)) +   
  #geom_vline(data=sum_dat, aes(xintercept=mean, color=group), size = 1.5,
            # linetype="dashed")+
    #geom_vline(data=sum_dat, aes(xintercept=median, color=group), size = 1.5,
             #linetype="dashed") +
  geom_vline(xintercept = c(end_median, start_median), size = 1.8,
             linetype="dashed", color = c("royalblue3","red3")) +
  annotate(geom="text", vjust = 1.7, size = 3.5, x = c(35),
           y = c(13.1, 16), label = c("Current Age Median", "Starting Age Median")) +
   geom_segment(aes(x = 32, y = 15.5, xend = 38, yend = 15.5),
                linetype="dashed", size = 1.5, color = "red3")+
   geom_segment(aes(x = 32, y = 12.5, xend = 38, yend = 12.5),
                linetype="dashed", size = 1.5, color = "royalblue3")+
  labs(title = "Age of Finding Alex G and Current Age", subtitle = "Overlapping Histogram", y = "# of Individuals", x = "Age", legend = "") +
  scale_fill_discrete(name = "",labels=c("Current Age", "Starting Listening Age")) +
  theme_classic(base_size = 13) + theme(legend.position="top")

ggsave("start_age_current.png", width = 9, height = 7)

Another plot I made, using the aforementioned “year learning of Alex G” data I extrapolated is a diagram of his fan base growth over time. I did this by using the “first year people learned of Alex G” data, and overlaying that with his album releases. However, unlike my initial making of this plot based on people just commenting on a Facebook post, this data is a little… odd.

##Second histogram
fact_order <- as.character(sort(unique(dat$year_learned)))


##Add album years
race = 2010
winner = 2011
Rules_Trick = 2012
DSU = 2014
beach_music = 2015
rocket = 2017
HoS = 2019


dat_2 <- (dat %>% select(year_learned))

ggplot(dat_2, aes(x=year_learned)) +
  geom_histogram(color = "black",  fill = "grey", binwidth = 1)+
    scale_x_continuous(breaks=seq(2009,2020,2)) +
  
  geom_vline(xintercept = race, linetype="dotted", 
                color = "deeppink1", size=1.5) +
   geom_vline(xintercept = winner, 
                color = "yellow3", size=1.5, linetype ="dotted") + 
  geom_vline(xintercept = Rules_Trick, 
                color = "deepskyblue1", size= 1.5, linetype ="dotted") +
    geom_vline(xintercept = DSU, 
                color = "springgreen4", size= 1.5, linetype ="dotted") +
      geom_vline(xintercept = beach_music, 
                color = "royalblue4", size= 1.5, linetype ="dotted") +
        geom_vline(xintercept = rocket, 
                color = "red3", size= 1.5, linetype ="dotted")  +
          geom_vline(xintercept = HoS, 
                color = "darkorchid4", size= 1.5, linetype ="dotted") + 
  annotate(geom="text", angle = 90, vjust = 1.3, size = 5,  
           x=c(race, winner, Rules_Trick, DSU, beach_music, rocket, HoS), 
           y=c(10, 10, 10, 10, 10, 10, 10), 
           label=c("Race", "Winner", "Rules/Trick", "DSU", "Beach Music", "Rocket", "House of Sugar"),
              #color="red"
           color = c("deeppink", "yellow3", "deepskyblue1", "springgreen4", "royalblue4", "red3", "darkorchid4")
           ) +
          labs(title = "Histogram of when people first heard of Alex G", x = "Year First Heard", y = "# of Individuals") +
  theme_classic(base_size = 13)

ggsave("year_start_listening.png", width = 9, height = 7)

So below is the original figure.

And compared to this figure our new one is definitely different. First of all, we have much more of a normal distribution besides the person… who first heard Alex G in ’07. Frankly, I’m a little dubious and would be interested in hearing that story. The individual did not identify which state they live in, so I thought my worries would be assuaged by hearing they lived in PA, but alas, I have no idea. The other interesting aspect is our first non ’07 fan comes in around 2011, after Race, an arguably seminal Alex G album. I would be expecting a larger boost with Race, and the previous plot I made did have individuals listening around the time of Race, admittedly few, but more than none. It is important to note Alex G was making music with his band the Skin Cells during the early years as well, but I’m less familiar with that history.

Also compared to the previous plot we have a lot more 2018-20 fans, but I think that’s primarily because the original data was pulled from late March/early April 2020, and this current survey has been rolling open the entire year, so that is not unexpected. But on the topic of the actual data, the data is for the most part normal, which is more indicative of the sample we’re pulling from, as I presume should just continue increasing with each album as his fame has only grown. Maybe there is a critical limit of indie fans that he might’ve reached, who knows! I personally believe this is also due to the fact that people trickle into these Facebook groups, so while more people might like Alex now, they might not be in the group (or willing to answer a silly survey). It is interesting that 2015 and the release of Beach Music is his largest jump in popularity, as it is also the same year he signed with Domino Records who probably helped promote his work to a wider audience.

How long have we been listening to Alex G?

This long.

dat_3 <- dat %>% select(difference)

ggplot(dat_3, aes(x=difference)) +
  geom_histogram(color = "black",  fill = "grey", binwidth = 1) + 
  scale_x_continuous(breaks=seq(min(dat_3$difference),max(dat_3$difference),1)) +
 # geom_vline(xintercept = mean(dat_3$difference)) +
  labs(title = "How Many Years Have We Been Listening?", x ="Length of Time in Years", y = "# of Indivduals") +
  theme_classic(base_size = 17)

ggsave("how_long_listening.png", width = 9, height = 7)

It’s the previous plot inverted, what can I say.

Cumulative Distribution Plot of Fan Growth

I just like cumulative distribution plots, it’s all the same data. If you don’t know how cumulative distribution plots work, basically the line shows over the years the growth of Alex’s fanbase as if counting up to the total, so you can see his percentage gain each year to his current 100% of the fanbase.

ggplot(dat, aes(x=year_learned)) + stat_ecdf() + scale_x_continuous(breaks=seq(2009,2020,2))+
  scale_y_continuous(labels = percent) +
  geom_vline(xintercept = race, linetype="dotted", 
                color = "deeppink1", size=1.5) +
   geom_vline(xintercept = winner, 
                color = "yellow3", size=1.5, linetype ="dotted") + 
  geom_vline(xintercept = Rules_Trick, 
                color = "deepskyblue1", size= 1.5, linetype ="dotted") +
    geom_vline(xintercept = DSU, 
                color = "springgreen4", size= 1.5, linetype ="dotted") +
      geom_vline(xintercept = beach_music, 
                color = "royalblue4", size= 1.5, linetype ="dotted") +
        geom_vline(xintercept = rocket, 
                color = "red3", size= 1.5, linetype ="dotted")  +
          geom_vline(xintercept = HoS, 
                color = "darkorchid4", size= 1.5, linetype ="dotted") + 
  annotate(geom="text", angle = 90, vjust = 1.3, size = 5,  
           x=c(race, winner, Rules_Trick, DSU, beach_music, rocket, HoS), 
           y=c(.10, .20, .30, .50, .80, .60, .75), 
           label=c("Race", "Winner", "Rules/Trick", "DSU", "Beach Music", "Rocket", "House of Sugar"),
              #color="red"
           color = c("deeppink", "yellow3", "deepskyblue1", "springgreen4", "royalblue4", "red3", "darkorchid4")
           ) +
          labs(title = "How has Alex G's Fanbase Grown?",subtitle = "Cummulative Distribution Plot of Alex G Fanbase Growth", x ="Year", y = "Percent of Current Fanbase") +
  theme_classic(base_size = 13)

ggsave("alex_g_fanbase_growth.png", width = 9, height = 7)

Alex G Fans Across the World

So Alex G, for those not in the know, lives in Philly. But, like most successful artists, people want to see him perform in other cities and he tours pretty regularly. So I thought it would be interesting to see where Alex G fans are all over the world!

country_codes <- read.csv("inputs/countrycode.csv", stringsAsFactors = F)

#Get a count of fans per each country
world_dat <- as.data.frame(table(dat$country), stringsAsFactors = F)
 
colnames(world_dat) <- c("Country.Name", "fans")
#Join with official data
world_dat  <- left_join(country_codes, world_dat)
 
world_needed <- world_dat %>% select(fans, ISO2, ISO3, Country.Name) 

#colors
pal <- wes_palette("Zissou1", 100, type = "continuous")
 
 
#Get a higher resolution
worldmap <- getMap(resolution = "low")

worldmap@data <- left_join(worldmap@data, world_needed)
worldmap_poly <- fortify(worldmap)
worldmap_poly <- merge(worldmap_poly, worldmap@data, by.x="id", by.y="ADMIN", all.x=T)

#Make fans per 100,000
worldmap_poly <- worldmap_poly %>% mutate(fans_per = ifelse(is.na(fans), 0, fans),
                                        fans_per = round((fans/POP_EST * 100000), digits = 3))
#Remove the US because they have to many individuals
worldmap_poly_nUSA <- worldmap_poly %>% filter(Country.Name != "United States")

Plotting the world

A simple plot of the whole world colored on fans per 100,000 individuals in each of those countries.

#Whole world with US
ggplot() + 
  coord_map(xlim = c(-180, 180), ylim = c(-60, 75))  +
  geom_polygon(data = worldmap_poly, aes(long, lat, group = group, fill = fans_per),size = 0.3) +
  geom_text(data=subset(worldmap@data, !is.na(SOVEREIGNT)), aes(x=LON, y=LAT,label=fans), size=1.5) + 
  scale_fill_gradientn(colours = c("cyan3", "gold", "orangered")) + 
  theme_void() +
  labs(fill = "Fans per
100,000 indv")

HOLD UP, WHERE IS ALL THE RED? You may say, to which I respond, “GIVE ME A SECOND I’LL GET TO IT WE GOTTA ZOOM IN EVERYWHERE FIRST TO FIND IT”

North American Alex G Fans

ggplot() + 
  coord_map(xlim = c(-180, 180), ylim = c(-60, 75))  +
  geom_polygon(data = worldmap_poly, aes(long, lat, group = group, fill = fans_per),size = 0.3) +
  geom_text(data=subset(worldmap@data, !is.na(SOVEREIGNT)), aes(x=LON, y=LAT,label=fans), size=5) +
 coord_cartesian(xlim = c(-150, -50), ylim = c(15, 60)) +
  scale_fill_gradientn(colours = c("cyan3", "gold", "orangered"))  + theme_void( base_size = 14) +
  labs(title = "Alex G Fans in North America", 
  fill = "Fans per
100,000 indv")

ggsave("outputs/north_america.png")

South American Alex G Fans

###South america
ggplot() + 
  coord_map(xlim = c(-180, 180), ylim = c(-60, 75))  +
  geom_polygon(data = worldmap_poly_nUSA, aes(long, lat, group = group, fill = fans_per), color = "black" ,size = 0.3) +

geom_text(data=subset(worldmap@data, !is.na(SOVEREIGNT)), aes(x=LON, y=LAT,label=fans), size=5)+
  coord_cartesian(xlim = c(-90, -30), ylim = c(-60, 15))  +
  scale_fill_gradientn(colours = c("cyan3", "gold", "orangered")) + theme_void( base_size = 14) +
  labs(title = "Alex G Fans in South America", 
  fill = "Fans per
100,000 indv")

ggsave("outputs/south_america.png", height = 8, width = 7)

South East Asia and Oceania Alex G Fans

###Oceana and south east asia
ggplot() + 
  coord_map(xlim = c(-180, 180), ylim = c(-60, 75))  +
  geom_polygon(data = worldmap_poly_nUSA, aes(long, lat, group = group, fill = fans_per), color = "black",size = 0.3) +

geom_text(data=subset(worldmap@data, !is.na(SOVEREIGNT)), aes(x=LON, y=LAT,label=fans), size=5)+
  coord_cartesian(xlim = c(100, 180), ylim = c(-50, 45))  +
  scale_fill_gradientn(colours = c("cyan3", "gold", "orangered"))  + theme_void( base_size = 14) +
  labs(title = "Alex G Fans in East Asian and Oceania", 
  fill = "Fans per
100,000 indv")

ggsave("outputs/south_asia_aus.png", height = 8, width = 7)

Europe Alex G Fans

###Europe
ggplot() + 
  coord_map(xlim = c(-180, 180), ylim = c(-60, 75))  +
  geom_polygon(data = worldmap_poly_nUSA, aes(long, lat, group = group, fill = fans_per), color = "black",size = 0.3) +
geom_text(data=subset(worldmap@data, !is.na(SOVEREIGNT)), aes(x=LON, y=LAT,label=(fans)), size=5)+
  coord_cartesian(xlim = c(-10, 30), ylim = c(30, 70))   + 
  scale_fill_gradientn(colours = c("cyan3", "gold", "orangered")) + theme_void( base_size = 14) +
  labs(title = "Alex G Fans in Europe", 
  fill = "Fans per
100,000 indv") +
  geom_text(aes(x = 18.3754, y =  34.3), label = "Malta is where all the
red on this map is")

ggsave("outputs/europe_current.png", height = 8, width = 7) 

Malta!

###Europe
ggplot() + 
  coord_map(xlim = c(-180, 180), ylim = c(-60, 75))  +
  geom_polygon(data = worldmap_poly_nUSA, aes(long, lat, group = group, fill = fans_per), color = "black",size = 0.3) +
geom_text(data=subset(worldmap@data, !is.na(SOVEREIGNT)), aes(x=LON, y=LAT,label=(fans)), size=5)+
  coord_cartesian(xlim = c(12, 16), ylim = c(35, 37))   + 
  scale_fill_gradientn(colours = c("cyan3", "gold", "orangered")) + theme_void( base_size = 14) +
  labs(title = "The Alex G Malta Fan", 
  fill = "Fans per
100,000 indv") +
  geom_text(aes(x = c(15.10, 14), y =  c(35.8, 37)), label = c("Behold, Malta,
population of ~500,000", "This is Italy, in case you're garbage with geography like me"))

ggsave("outputs/europe_current.png", height = 8, width = 7) 

So with all that said, the only interesting thing I could think of to compare this data to is what countries does Alex G most frequently tour in so… tada, I took this from setlist.fm, they actually have interesting concert statistics so I suggest checking them out. I actually had a project idea where I was gonna scrape their data and make an average Alex G setlist for each year and generate a naive Bayes model/shiny app allowing people to put in their perfect setlist and return the probability of that concert happening. But they already did the first thing, and I ended up doing this entire analysis instead of the naive Bayes thing, that’s life. Right, TADA:

Alex G Concerts Around the World from setlist.fm

So I put that data into the following csv file I load in. And I want to see if the number of fans correlate with how many concerts Alex has in that country. But you can see when I first saw the data….

concerts_world <- read.csv("inputs/countries_concerts.csv", fileEncoding = "UTF-8-BOM")

joined_country_concerts <- left_join(concerts_world, world_dat)

ggplot(joined_country_concerts, aes(y = concerts, x = fans, label = Country.Name)) +
  geom_point() +
  labs(title = "Thanks America, guess we'll log scale") +
  theme_classic()

America ruined the graph. So I decided to log scale it and make it more informative with shapes and colors.

#We keepin Ireland
#joined_country_concerts <- joined_country_concerts %>% filter(!is.na(joined_country_concerts$fans))

shape_levels <- rev(1:nlevels(as.factor(joined_country_concerts$Country.Name)))




ggplot(joined_country_concerts, aes(y = concerts,
                                    x = fans,
                                    shape = Country.Name,
                                    color = Country.Name)) +
  geom_point(size = 3) +
  scale_x_continuous(breaks=c(0,1,2,3,4,5,10,30,100,300), trans="log1p") +
  scale_y_continuous(breaks=c(0,1,2,3,4,5,10,30,100,300), trans="log1p") +
             scale_shape_manual(values=shape_levels) +
  labs(title = "Alex G Concerts compared to number of fans, log10 + 1") +
  theme_classic()

Ireland had 2 concerts and no fans which is why you can’t see it :(

Alex G fans in the United States

So now let us look at the Alex G Fans around the US.

I used the below vignette to do guide this section https://cran.r-project.org/web/packages/usmap/vignettes/mapping.html

us_states <- map_data("state")
#head(us_states)

#Filter state data
state_dat <- dat %>% filter(!(state %in% c("I don't live in the states", ""))) %>% mutate(region = tolower(state))

state_df <- data.frame(table(state_dat$region),
                       colnames = "regions",
                       stringsAsFactors = F) %>%
              select(region = Var1, count = Freq)

us_states_2 <- left_join(us_states, state_df)

I got the census data from census.gov particularly the file named: NST-EST2019-01: Table 1. Annual Estimates of the Resident Population for the United States, Regions, States, and Puerto Rico: April 1, 2010 to July 1, 2019 I’m not gonna lie, I was lazy and cleaned up the data in excel, but basically I selected the 2019 data only as it was most relevant. I used this data primarily to control for population in the coloration of the maps below. I don’t really have much to say on them so enjoy!

#######
census <- read.csv("inputs/state_census.csv", stringsAsFactors = F) %>%
          mutate(region = tolower(str_replace(region, pattern = "[.]", replacement = ""))) %>% select(region, X2019)
          
census <- census %>% mutate(X2019= round(as.numeric(str_replace_all(X2019, ",", ""))))

#100000

us_states_3 <- left_join(us_states_2, census) %>% mutate(count = ifelse(is.na(count), 0, count),
                                                         adjust = round((count/X2019 * 100000), digits = 3))
## This data I'm gonna use to center text in the states, it's basically
## each state’s central longitude and latitude.

center_state <- read.csv(file = "inputs/statelatlong_2.csv", stringsAsFactors = F) %>% rename(region = name, lat_center = latitude, long_center = longitude) %>%
  mutate(region = tolower(region))

us_states_3 <- left_join(us_states_3, center_state)
p <- ggplot(data = us_states_3,
            mapping = aes(x = long, y = lat,
                          group = group, fill = adjust), color = "black")

Plotting all of the US

#All of US
p + geom_polygon(color = "gray90", size = 0.1) +
    coord_map(projection = "albers", lat0 = 39, lat1 = 45) +
    geom_text(aes(long_center, lat_center, label = count), size = 3.2)  + 
  scale_fill_gradientn(colours = c("cyan3", "gold", "orangered")) +
  theme_void() +
  labs(title = "          Alex G Fans in the USA",fill = "Fans per
100,000 indv")

ggsave("outputs/all_us.png", height = 6, width = 8)

Plotting the North East

#North East
p + geom_polygon(color = "gray90", size = 0.1) +
    coord_map(xlim = c(-80, -65), ylim = c(38,48)) +
    geom_text(aes(long_center, lat_center, label = count), size = 3.3)  + 
  scale_fill_gradientn(colours = c("cyan3", "gold", "orangered")) +
  theme_void() +
  labs(title = "         Alex G Fans in the North East",fill = "Fans per 
100,000 indv")

ggsave("outputs/NE_us.png", height = 8, width = 8)

Plotting the South

#The south
p + geom_polygon(color = "gray90", size = 0.1) +
    coord_map(xlim = c(-110, -65), ylim = c(25,40)) +
    geom_text(aes(long_center, lat_center, label = count), size = 3.3)  + 
  scale_fill_gradientn(colours = c("cyan3", "gold", "orangered")) +
  theme_void() +
  labs(title = "    Alex G Fans in the South",fill = "Fans per 
100,000 indv")

ggsave("outputs/south_us.png", height = 8, width = 12)

Plotting the Midwest

#The Mid West
p + geom_polygon(color = "gray90", size = 0.1) +
    coord_map(xlim = c(-114, -80), ylim = c(32,49)) +
    geom_text(aes(long_center, lat_center, label = count))  + 
  scale_fill_gradientn(colours = c("cyan3", "gold", "orangered")) +
  theme_void() +
  labs(title = "      Alex G Fans in the Mid West",fill = "Fans per
100,000 indv")

ggsave("outputs/midwest_us.png", height = 8, width = 9)

Plotting the West

#The west
p + geom_polygon(color = "gray90", size = 0.1) +
    coord_map(xlim = c(-125, -110), ylim = c(30,49)) +
    geom_text(aes(long_center, lat_center, label = count))  + 
  scale_fill_gradientn(colours = c("cyan3", "gold", "orangered")) +
  theme_void() +
  labs(title = "      Alex G Fans in the West",fill = "Fans per
100,000 indv")

ggsave("outputs/west_us.png", height = 9, width = 7)

So the analysis I decided to do here was ask “Where is Alex’s fan base the largest (when controlling for population size) in the United States?” I pulled this region data from kaggle and will use it to test our hypothesis. “Our hypothesis?” you say? Yes, obviously it is the East Coast which has the most representation, if it’s not, all hope will be lost.

(P.S. I’ve learned of the state.region data in R, but womp womp here we R, let’s just go with it)

state_region <- read.csv("inputs/states.csv", stringsAsFactors = FALSE)

us_state_region <- us_states_3 %>%
  select(State = region, count, adjusted_fans = adjust, State.Code = state, pop = X2019) %>%
  mutate(State = str_to_title(State)) %>% unique()

us_state_region <- left_join(us_state_region, state_region) %>%
  mutate(Region = case_when(
    State == "District Of Columbia" ~ "South",
    TRUE ~ Region
  ), Division = case_when(
    State == "District Of Columbia" ~ "South Atlantic",
    TRUE ~ Division
  ))
summarised_region <- us_state_region %>%
  group_by(Region) %>%
  summarise(adjusted_fans = sum(adjusted_fans), fans = sum(count), pop = sum(pop)) %>%
  mutate(proportion = fans/pop)



ggplot(summarised_region, aes(Region, adjusted_fans, fill = Region)) +
  geom_bar(stat = "identity") +
  labs(title = "Adjusted Fans Per Region",
       subtitle = paste("I mean... I think the east coast still wins when you add its sections together?",
                        "N = ", nrow(state_dat))) +
  ylab("Adjusted Fans") +
  theme_classic()

prop.test(x = summarised_region$fans, n = summarised_region$pop)
## 
##  4-sample test for equality of proportions without continuity
##  correction
## 
## data:  summarised_region$fans out of summarised_region$pop
## X-squared = 3.4198, df = 3, p-value = 0.3313
## alternative hypothesis: two.sided
## sample estimates:
##       prop 1       prop 2       prop 3       prop 4 
## 4.390522e-07 5.001536e-07 3.264839e-07 4.199483e-07

Well there doesn’t seem to be a clear winner… I’ll just tell myself if we included Southern East Coast states it would’ve won.

666posting Political Compass

Spice Alert

So one of the first graphs I generated with the survey because I thought it would be a fun one to do, is looking at the political leanings in the group. While politics has a lot more depth than can be represented by my figure, I think having two axis, left to right and libertarian to authoritarian, allows a fair amount of depth in itself. The question was phrased as follows:

“Where do you stand on the political compass? This website will calculate it, but you can do it by personal feel as well by looking at the image! https://www.politicalcompass.org/ (center at 5 and round if you get a score from the website, site ~10 minutes)”

With the below image as a guide, the image isn’t completely accurate if you ask me, but from a quick google search, it does a good enough job.

The reason that I said this is spicy is because, let’s be honest, we live in an incredibly politically charged time. And when I posted this, I completely understood people not being comfortable with individuals in the group having, uh, “strong fascist leanings” so it caused a bit of a row (Never a good look when your post has more comments than likes). That said, I still think the data is pretty interesting so let’s look at it.

## Cleaning the dataset and getting medians and means for each leaning

#I remove one fill in the blank because... yeah
political_dat <- select(dat,left_right, lib_auth, gender, age) %>% drop_na() 
#%>% mutate(gender = ifelse(gender =="There are only 2 genders", "Prefer not to say", gender))


#Get the medians and means of the compass
med_LR <- median(political_dat$left_right)
med_LA <- median(political_dat$lib_auth)
mean_LR <- mean(political_dat$left_right)
mean_LA <- mean(political_dat$lib_auth)

Political plots

As the data is technically a non-continuous discrete ordinal value as I asked for answers in integers this is technically one of the better ways to represent the data. The size of the circle represents the number of individuals who selected said option. We will later have more readable examples of this data later on in case you’re interested in exact numbers.

political_dat_2 <- rbind(political_dat, c(9, 8))

pol_count <- as.data.frame(table(political_dat_2$left_right, political_dat_2$lib_auth))
pol_count <- pol_count %>% mutate(Freq = ifelse((Var1 == "9" & Var2 == "8"), Freq - 1, Freq))

colnames(pol_count)[1:2] <- c("left_right", "lib_auth")
pol_count$left_right <- as.integer(as.character(pol_count$left_right))
pol_count$lib_auth <- as.integer(as.character(pol_count$lib_auth))

pol_count <- filter(pol_count, Freq != 0)

ggplot(pol_count) +
          #  
            geom_rect(aes(xmin = 5, xmax= 10,ymin = 0, ymax=5), fill = "yellow", alpha = 0.02) +
            geom_rect(aes(xmin = 0, xmax= 5,ymin = 0, ymax=5), fill = "green", alpha = 0.02) +
            geom_rect(aes(xmin = 5, xmax= 10,ymin = 5, ymax=10), fill = "blue", alpha = 0.02) +
            geom_rect(aes(xmin = 0, xmax= 5,ymin = 5, ymax=10), fill = "red", alpha = 0.02) + 
           geom_point(aes(x = left_right, y = lib_auth, size= Freq)) +
           #Adding median and mean
           geom_point(aes(x=med_LR, y=med_LA), colour="white", shape = 9, size =3) +
           geom_point(aes(x=mean_LR, y=mean_LA), colour="red", shape = 9, size =3) +
         #Add the dotplot annotation
          geom_point(aes(x=8, y=3), colour="white", shape = 9, size =4) +
          geom_point(aes(x=8, y=1.5), colour="red", shape = 9, size =4) +
          labs(title = "666posting Political Compass", subtitle = paste0("N = ", nrow(political_dat))) + 
   ylab("Libertarian to Authoritarian") +
               xlab("Left to Right") +
              # scale_x_continuous(breaks = c(0:10))+
               #scale_y_continuous(breaks = c(0:10)) +
              scale_x_continuous(breaks = c(0,2,4,6,8,10))+
               scale_y_continuous(breaks = c(0,2,4,6,8,10)) +
               scale_size_continuous(breaks = c(1, 3, 6, 9, 13, 15, 17),  range = c(2.5,8)) + 
                                annotate("text", x = c(8, 8),
                                       y = c(2, 3.5),
                                       label = c("Mean", "Median"), size =4) +
   annotate("text", x = c(2.5, 2.5, 7.5, 7.5),
                                       y = c(10.5, -.5, 10.5, -.5),
                                       label = c("Auth Left", "Lib Left", "Auth Right", "Lib Right"), size =5) + 
              coord_fixed() + theme_void(base_size = 14)

ggsave("outputs/political_compass_exact.png")

Second political compass scatterplot with a jitter and the shape of the points relating to gender. In case you’re wondering why some points are off of the figure that is due to the jitter, I apologize for that.

ggplot(political_dat) +
           geom_rect(aes(xmin = 5, xmax= 10,ymin = 0, ymax=5), fill = "yellow", alpha = 0.005) +
           geom_rect(aes(xmin = 0, xmax= 5,ymin = 0, ymax=5), fill = "green", alpha = 0.005) +
           geom_rect(aes(xmin = 5, xmax= 10,ymin = 5, ymax=10), fill = "blue", alpha = 0.005) +
           geom_rect(aes(xmin = 0, xmax= 5,ymin = 5, ymax=10), fill = "red", alpha = 0.005) + 
          geom_jitter(aes(x = left_right, y = lib_auth, shape = gender), width =.5, height =.5, alpha = .5, ) +
          #Adding median and mean
          geom_point(aes(x=med_LR, y=med_LA), colour="blue", shape = 9, size =2) +
          geom_point(aes(x=mean_LR, y=mean_LA), colour="red", shape = 9, size =2) +
          #Add the dotplot annotation
          geom_point(aes(x=8, y=3), colour="blue", shape = 9, size =3) +
          geom_point(aes(x=8, y=1.5), colour="red", shape = 9, size =3) +
  #Add annotations to the compass
          labs(title = "666posting Political Compass", subtitle = paste0("N = ", nrow(political_dat))) + 
   ylab("libertarian to authoritarian") +
               xlab("left to right") +
              coord_fixed() + annotate("text", x = c(2.5, 2.5, 7.5, 7.5),
                                       y = c(10.5, -.5, 10.5, -.5),
                                       label = c("Auth Left", "Lib Left", "Auth Right", "Lib Right"), size =5) +
                                annotate("text", x = c(8, 8),
                                       y = c(2, 3.5),
                                       label = c("Mean", "Median"), size =4) + 
                                       theme_void(base_size = 14)

ggsave("outputs/political_compass_gender.png")

The same scatterplot as above, but this time the dots are colored on the age of the individuals. I had to remove the individual over 40 because that threw the scale off even when I used a log scale, so my apologies to them.

#make age normalized because of skewing
political_dat_age <- political_dat %>% filter(age < 40)

#Age
ggplot(political_dat_age) +
           
           geom_rect(aes(xmin = 5, xmax= 10,ymin = 0, ymax=5), fill = "yellow", alpha = 0.005) +
           geom_rect(aes(xmin = 0, xmax= 5,ymin = 0, ymax=5), fill = "green", alpha = 0.005) +
           geom_rect(aes(xmin = 5, xmax= 10,ymin = 5, ymax=10), fill = "blue", alpha = 0.005) +
           geom_rect(aes(xmin = 0, xmax= 5,ymin = 5, ymax=10), fill = "red", alpha = 0.005) + 
          geom_jitter(aes(x = left_right, y = lib_auth, color = age), width =.5, height =.5 ) +
          #Adding median and mean
          geom_point(aes(x=med_LR, y=med_LA), colour="blue", shape = 9, size =2) +
          geom_point(aes(x=mean_LR, y=mean_LA), colour="red", shape = 9, size =2) +
          labs(title = "666posting Political Compass", subtitle = paste0("N = ", nrow(political_dat))) + 
   ylab("libertarian to authoritarian") +
               xlab("left to right") +
              coord_fixed() + annotate("text", x = c(2.5, 2.5, 7.5, 7.5),
                                       y = c(10.5, -.5, 10.5, -.5),
                                       label = c("Auth Left", "Lib Left", "Auth Right", "Lib Right"), size =5) + 
                                       scale_colour_gradient(low = "black", high = "white", na.value = NA) +
                                       theme_void()

Let us see if we can break this data up into the different quadrants, including wiggle room in the middle for centrists. Again its a scale of 0-10, so I’m gonna have 4-6 as centrist.

political_dat <- political_dat %>% mutate(quadrant = case_when(
  left_right <= 3 & lib_auth <= 3 ~ "Libertarian Left",
  left_right <= 3 & lib_auth >= 7 ~ "Authoritarian Left",
  left_right >= 7 & lib_auth <= 3 ~ "Libertarian Right",
  left_right >= 7 & lib_auth >= 7 ~ "Authoritarian Right",
  left_right >= 4 & left_right <= 6 & lib_auth >= 4 & lib_auth <= 6 ~ "Centrist",
  left_right >= 4 & left_right <= 6 & lib_auth <= 3 ~ "Libertarian Center",
  left_right >= 4 & left_right <= 6 & lib_auth >= 7 ~ "Authoritarian Center",
  lib_auth >= 4 & lib_auth <= 6 & left_right >= 7 ~ "Center Right",
  lib_auth >= 4 & lib_auth <= 6 & left_right <= 3 ~ "Center Left"
), quadrant = factor(quadrant, levels = c("Libertarian Left",
                                          "Authoritarian Left",
                                          "Center Left",
                                          "Libertarian Center",
                                          "Centrist",
                                          "Authoritarian Center",
                                          "Libertarian Right",
                                          "Center Right",
                                          "Authoritarian Right"
                                          )))


ggplot(political_dat, aes(quadrant, age, fill = quadrant)) +
   geom_violin() +
  geom_sina(alpha = .5) +
  scale_fill_brewer(palette = "Set1") +
  stat_n_text(size = 3)  +
  theme_classic() +
  theme(legend.position = "none") +
  labs(
    title = "Quadrants compared to Age",
    subtitle = paste0("N = ", nrow(political_dat))
  ) +
  xlab("Quadrant") +
  ylab("Age") + coord_flip()

While this plot is interesting, I don’t think there is a good way to test these groups because some of them have very few individuals, however there is another way we can look at this.

ggplot(political_dat, aes(x =  factor(left_right, 0:10), y = age, fill = factor(left_right, 0:10) )) +
  geom_violin(width = 2 ) +
  geom_sina(alpha = .5) +
  scale_fill_brewer(palette = "Set1") +
  stat_n_text(size = 3)  +
  theme_classic() +
  theme(legend.position = "none") +
  labs(
    title = "Left to Right compared to Age",
    subtitle = paste0("N = ", nrow(political_dat))
  ) +
  xlab("Left to Right (0-10)") +
  ylab("Age") + coord_flip()

broom::tidy(lm(age ~ left_right, data = political_dat))
## # A tibble: 2 x 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)  23.2        0.414    56.1   6.70e-124
## 2 left_right    0.0375     0.139     0.270 7.88e-  1

Well that’s not significant, how about being libertarian or authoritarian?

ggplot(political_dat, aes(x =  factor(lib_auth, 0:10), y = age, fill = factor(lib_auth, 0:10) )) +
 geom_violin(width = 1) +
  geom_sina(alpha = .5) +
  scale_fill_brewer(palette = "Set1") +
  stat_n_text(size = 3)  +
  theme_classic() +
  theme(legend.position = "none") +
  labs(
    title = "Lib to Auth compared to Age",
    subtitle = paste0("N = ", nrow(political_dat))
  ) +
  xlab("Lib to Auth (0-10)") +
  ylab("Age") + coord_flip()

tidy(lm(age ~ lib_auth, data = political_dat))
## # A tibble: 2 x 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)  23.5        0.467    50.3   4.33e-115
## 2 lib_auth     -0.0644     0.124    -0.519 6.04e-  1

Well, I guess I can say, specifically in this 666posting cohort, your age doesn’t seem to influence your political leanings. Sorry, that wasn’t interesting, but again there are some groupings with very few samples so that didn’t help either and I didn’t remove the outliers.

Musicians in 666posting

Another question asked is if you play instruments or not, and if so which ones?

Preparing the data and make a melted version of the dataframe

band_dat <- dat %>% mutate(id = row_number(), 
                           instruments = gsub("Music Software \\(eg. Ableton, FL studio\\)",
                                              "Music Software",
                                              instruments),
                           instruments = str_replace_all(instruments,
                                                         "a little bit of chinese flute",
                                                         "Chinese Flute")) %>% 
                     filter(!(str_detect(instruments, "anymore")) |
                              instruments != "Nope" |
                              str_detect(instruments, "not")) %>%
                    select(id, instruments, in_band) 

band_dat_melt <- band_dat %>%
  mutate(instrument = (str_split(instruments, ","))) %>%
  unnest(cols = instrument) %>%
  mutate(instrument =(str_trim(instrument, "both"))) %>%
  filter(!(instrument %in% c("Used to","but not anymore", "", "Nope")))

#Check data looks okay
unique(band_dat_melt$instrument)
##  [1] "Guitar"           "Banjo"            "Voice"            "Accordion"       
##  [5] "Bass"             "Violin/Viola"     "Percussion/Drums" "Piano/Keyboard"  
##  [9] "Brass"            "Music Software"   "Cello"            "Mandolin"        
## [13] "Sampler"          "Woodwind"         "Ukulele"          "omnichord"       
## [17] "Chinese Flute"    "clarinet"         "synthesiser"
#Make dataframe giving a 1 or 0 to each person based on if they play the instrument or not, 1 is TRUE and 0 is false
instr_dat <- (as.data.frame.matrix(table(band_dat_melt$id, band_dat_melt$instrument)))

Generate general summaries to be used in the first plot

#General stats
sum_instr <- as.data.frame(table(band_dat_melt$instrument),
                           stringsAsFactors = F) %>%
              mutate(Var1 = str_to_title(Var1))

Plot of instruments played in 666posting

ggplot(sum_instr, aes(x = reorder(Var1, -Freq), y = Freq, fill = Var1)) +
  geom_bar(stat= "identity") +
  labs(title = "Instruments played in this group",
       subtitle = paste0("N = ", length(unique(band_dat_melt$id)))) +
  scale_y_continuous(limits = c(0, 125)) +
  ylab(label = "Count") +
  xlab(label = "Instruments") + 
    theme_classic(base_size = 13) +
 theme(axis.text.x = element_text(angle=85, hjust=1, vjust = 1),
       legend.position = "None") +
  geom_text(aes(label=Freq),
            position=position_dodge(width=0.9),
            vjust=-0.25)

ggsave("outputs/instrument_summary.png", width = 8, height = 8)

The case of the Upset Plot vs the Venn Diagram

Okay, here I’m about to go on a visualization tirade. So I’m sure you’re used to Venn diagrams, but have probably not heard of Upset plots. So Venn diagrams are the classic way to represent count crossover between multiple groups, but when groups are large this generally reduces to circles with numbers inside of them. This doesn’t do the scale of the differences proper justice, and people, in general, have a better time understanding the magnitude of difference when there is a visualization, especially when comparing sizes of adjacent bars. As a result, a new plot has been formed called Upset plots.

I will explain how to read them now. An Upset plot is comprised of two bar plots that relate to the central image with balls and lines. The barplot on the “y-axis” of the center image is the total count for how many people play each instrument. If we look at the center image, you can see each row is labeled by an instrument. Each column however has different dots representing each instrument. When a dot is filled that shows an overlap between the two instruments, the amount of those overlaps is what is counted in the “x-axis” barplot above. So one plot gives you the total number of people who play a certain instrument, while the other plot gives you how many people share the same overlap of instruments. I hope this makes, sense, but it will be readily apparent once you see the plots.

Anyways the way to read this diagram is as follows: The total count of each instrument is in the bottom left corner, you can see that we have over 100 guitarists who answered the survey followed by ~60 keyboard/piano players. Then if you look to the right of that graph you’ll see dots and lines. If two dots are filled that means that the bar above it represents the intersection of individuals who play both those instruments. As more dots are filled that represents a greater intersection. So you can see that there are 25 solo guitarists, while there are 2 people who play all of the instruments in this graph.

As I said, I couldn’t fit in all the data because of the limitations so sorry to all the unique instrument players, but at least you got a shout out in the first graph you Chinese flute-playing god.

#Make Instrument strings look like titles
colnames(instr_dat) <- str_to_title(colnames(instr_dat))

alex_musicians = nrow(instr_dat %>% filter(Banjo == 1, Bass == 1, Guitar == 1, Mandolin == 1, `Music Software` == 1, `Percussion/Drums` == 1, `Piano/Keyboard` == 1, Voice == 1))

instru = 1
instrument_list <- list()
#Probably should change this to a lapply, but we use this loop to generate a list of ids that play a given instrument
for (instru in 1:ncol(instr_dat)) {
   instrument_list[[colnames(instr_dat)[instru]]] <- (1:nrow(instr_dat))[instr_dat[,instru] == 1]
}

First Upset Plot

plot_up <- upset(fromList(instrument_list),
                 order.by = "freq",
                 nsets = 6,
                 #point.size = 4,
                 #line.size = 1.8,
                 text.scale=c(2.8, 1.6, 1.8, 1.6, 2, 2),
                 mainbar.y.label = "Instrument Intersections",
                 sets.x.label = "People Per Instrument"
                 )

plot_up

Second Instrument plot, it contains more instruments

plot_ups <- upset(fromList(instrument_list),
                  order.by = "freq",
                  nsets = 30,
                  #point.size = 2.8,
                  #line.size = 1,
                 text.scale=c(2.8, 1.7, 1.8, 1.6, 1.5, 2),
                  mainbar.y.label = "Instrument Intersections",
                  sets.x.label = "People Per Instrument"
                  )

plot_ups

Fun fact there are 2 who play the same instruments as Alex, including the guitar, bass, drums, banjo, voice, mandolin, music software, and keyboard (to my knowledge).

#muted for final readout

# png("outputs/instrument_plot.png", width = 800, height = 800)
# plot_ups
# dev.off()
# 
# png("outputs/instrument_5_plot.png", width = 800, height = 800)
# plot_up
# dev.off()

How many musicians are in a band?

One last quick question about our instrumentalists, are you in a band?

band_yn <- band_dat_melt %>% select(id, in_band) %>% unique()


ggplot(band_yn, aes(in_band)) +
  geom_bar() + theme_classic() +
  ylab("Count of individuals") +
  xlab("Are you in a band?") +
  labs(title = "Are 666posting musicians in a band?",
      subtitle = paste("N = ", nrow(band_yn)))

Frankly, I expected “No” to be the dominant answer, but I am surprised by how good a fight “Yes” put up. Maybe that’s my personal bias of knowing more musicians who aren’t in bands.

Non-musical art

This is basically the same analysis as above, but this time we’re looking at the rest of the world of The Arts!

art_dat <- dat %>%  filter(art != "") %>%
                    select(Timestamp, art) 

art_dat_melt <- art_dat %>%
  mutate(art = (str_split(art, ","))) %>%
  unnest(cols = art) %>%
  mutate(art =(str_trim(art, "both"))) %>%
  filter(!(art %in% c("Used to","but not anymore", "", "Nope")))

#Check data looks okay
#unique(band_dat_melt$art)


art_dat <- (as.data.frame.matrix(table(art_dat_melt$Timestamp, art_dat_melt$art)))
#General stats
sum_art <- as.data.frame(table(art_dat_melt$art),
                           stringsAsFactors = F) %>%
              mutate(Var1 = str_to_title(Var1))

Plot of art mediums in 666posting

ggplot(sum_art, aes(x = reorder(Var1, -Freq), y = Freq, fill = Var1)) +
  geom_bar(stat= "identity") +
  labs(title = "Art mediums used in this group",
       subtitle = paste0("N = ", nrow(art_dat))) +
  scale_y_continuous(limits = c(0, 75)) +
  ylab(label = "Count") +
  xlab(label = "Art medium") + 
    theme_classic(base_size = 13) +
 theme(axis.text.x = element_text(angle=85, hjust=1, vjust = 1),
       legend.position = "None") +
  geom_text(aes(label=Freq),
            position=position_dodge(width=0.9),
            vjust=-0.25)

ggsave("outputs/art_summary.png", width = 8, height = 8)
#Prepare art upset plot

#Make Instrument strings look like titles
colnames(art_dat) <- str_to_title(colnames(art_dat))

art_hold = 1
art_list <- list()
#Probably should change this to a lapply, but we use this loop to generate a list of ids that play a given instrument
for (art_hold in 1:ncol(art_dat)) {
   art_list[[colnames(art_dat)[art_hold]]] <- (1:nrow(art_dat))[art_dat[,art_hold] == 1]
}

Art Upset Plot

art_upset <- upset(fromList(art_list),
                 order.by = "freq",
                 nsets = 6,
                 #point.size = 4,
                 #line.size = 1.8,
                 text.scale=c(2.8, 1.6, 1.8, 1.6, 2, 2),
                 mainbar.y.label = "Art Medium Intersections",
                 sets.x.label = "People Per Medium"
                 )

art_upset

Favorite Music/Musicians

Word Clouds

So… word clouds. Technically speaking word clouds are never a good way to represent data in a meaningful way. While the size of the letters increases one can still be tricked into thinking, in this case, long band names have more votes, which isn’t true. As a result, you’ll see that I ended up coloring the data a little bit to help clarify this visual discrepancy.

musician_dat <- dat %>%
        select(Timestamp, favorite_musician, sec_musician = second_favorite_musician, music_genre)  %>%
        #Mutate all names to lower first and trim whitesppace
        mutate(favorite_musician = str_trim(tolower(favorite_musician)),
               sec_musician = str_trim(tolower(sec_musician)),
        #Fix all spelling of musician names that stand out to me
               favorite_musician = case_when(
                 favorite_musician == "elliot smith" ~ "elliott smith",
                 str_detect(favorite_musician, "suicide") ~ "teen suicide",
                 str_detect(favorite_musician, "title") ~ "title fight",
                 T ~ favorite_musician),
               sec_musician = case_when(
                 favorite_musician == "elliot smith" ~ "elliott smith",
                 str_detect(favorite_musician, "suicide") ~ "teen suicide",
                 T ~ sec_musician
               ),
        #Clean up alex g titles to new/old official spelling and string to title for presentation
               favorite_musician = ifelse(favorite_musician == "(sandy) alex g", "Alex G", str_to_title(favorite_musician)),
               sec_musician = ifelse(sec_musician == "(sandy) alex g", "Alex G", str_to_title(sec_musician))) %>%
        #clean all blanks
        filter(favorite_musician != "", sec_musician != "", music_genre != "")

fave_musician <- as.data.frame(table(musician_dat$favorite_musician), stringsAsFactors = F)
sec_musician <- as.data.frame(table(musician_dat$sec_musician), stringsAsFactors = F)
color_artist <- brewer.pal(n = 3, name = "Dark2")

fave_musician <- fave_musician %>% mutate(color_fan = case_when(
                            Freq >= 29 ~ color_artist[1],
                            Freq > 1 ~ color_artist[2],
                            T ~ "black"
                          ))


sec_musician <- sec_musician %>% mutate(color_fan = case_when(
                            Freq >= 29 ~ color_artist[1],
                            Freq > 1 ~ color_artist[2],
                            T ~ "black"
                          ))

Word cloud For Favorite musician

#Word cloud for markdown

wordcloud(words = fave_musician$Var1, freq = fave_musician$Freq, min.freq = 1,
                            max.words=200, random.order=FALSE, rot.per=0.30, 
                            colors=fave_musician$color_fan, ordered.colors = T,
          scale=c(5,.666)
                            )

#Print word cloud

## Silented printing for final release
# png("outputs/favorite_musician.png", units="in", width=5, height=5, res=300)
# wordcloud(words = fave_musician$Var1, freq = fave_musician$Freq, min.freq = 1,
#                             max.words=200, random.order=FALSE, rot.per=0.35, 
#                             colors=fave_musician$color_fan, ordered.colors = T
#                             )
# dev.off()

If they have more than one fan, the name is in orange.

Barplot for top 10 musicians except Alex G because he makes the plot look bad

top_fave <- fave_musician %>% arrange(desc(Freq)) %>% head(11)

top_fave_nog <- top_fave[-1,]

ggplot(top_fave_nog, aes(x = factor(Var1, rev(top_fave_nog$Var1)), y = Freq)) +
  geom_bar(stat = "identity") + coord_flip() +
  xlab("Top Favorite Musicians") +
  ylab("Count") +
  labs(title ="Top Ten Favorite Non-Alex G Artists") + theme_classic()

Alex has 80 fans by the way.

Word cloud For Second Favorite musician

#Word cloud for markdown

wordcloud(words = sec_musician$Var1, freq = sec_musician$Freq, min.freq = 1,
                            max.words=200, random.order=FALSE, rot.per=0.35, 
                            colors=sec_musician$color_fan, ordered.colors = T,
                            scale=c(5,0.5)
                            )

#Print word cloud
png("outputs/second_musician.png", units="in", width=5, height=5, res=300)
wordcloud(words = sec_musician$Var1, freq = sec_musician$Freq, min.freq = 1,
                            max.words=200, random.order=FALSE, rot.per=0.35, 
                            colors=sec_musician$color_fan, ordered.colors = T
                            )
dev.off()
## png 
##   2
sec_fave <- sec_musician %>% arrange(desc(Freq)) %>% head(11)

sec_fave_nog <- sec_fave[-1,]

ggplot(sec_fave_nog, aes(x = factor(Var1, rev(sec_fave_nog$Var1)), y = Freq)) +
  geom_bar(stat = "identity") + coord_flip() +
  xlab("Top Secibd Favorite Musicians") +
  ylab("Count") +
  labs(title ="Top Ten Second Favorite Non-Alex G Artists") + theme_classic()

Alex has 33 individuals who consider him their second favorite artist.

Favorite Genres

genre_dat <- musician_dat %>%
                  mutate(music_genre = str_replace(str_to_lower(str_trim(music_genre)), "-", " ")) %>%
                  filter(music_genre != "") %>%
                  mutate(music_genre = case_when(
                    music_genre %in% "imdie rock" ~ "indie rock",
                    music_genre %in% "lofi / 'indie rock' i guess lol" ~ "lofi",
                    music_genre %in% "indite rock" ~ "indie rock",
                    music_genre %in% "indie rock / folk / jazz" ~ "jazz",
                    music_genre %in% "indie rock, punk, alternative." ~ "punk",
                    music_genre %in% "shoegaze/indie rock" ~ "shoegaze",
                    T ~ music_genre
                    ),
                    music_genre = str_to_title(music_genre))

genre_table <- data.frame(table(genre_dat$music_genre), stringsAsFactors = F)


genre_table <- genre_table %>% mutate(color_fan = case_when(
                            Freq >= 30 ~ color_artist[1],
                            Freq > 1 ~ color_artist[2],
                            T ~ "black"
                          ))

Word cloud For Favorite Genre

#Word cloud for markdown

wordcloud(words = genre_table$Var1, freq = genre_table$Freq, min.freq = 1,
                            max.words=200, random.order=FALSE, rot.per=0.30, 
                            colors=genre_table$color_fan, ordered.colors = T, 
                             scale=c(4,1)
                            )

# #Print word cloud
# png("outputs/favorite_genre.png", units="in", width=5, height=5, res=300)
# wordcloud(words = genre_table$Var1, freq = genre_table$Freq, min.freq = 1,
#                             max.words=200, random.order=FALSE, rot.per=0.30, 
#                             colors=genre_table$color_fan, ordered.colors = T
#                             )
# dev.off()

Favorite Genre Barplot

genre_fave <- genre_table %>% arrange(desc(Freq)) %>% head(10)

ggplot(genre_fave, aes(x = factor(Var1, rev(genre_fave$Var1)), y = Freq)) +
  geom_bar(stat = "identity") + coord_flip() +
  xlab("Top genreibd Favorite Musicians") +
  ylab("Count") +
  labs(title ="Top 10 Favorite Genres")  + theme_classic()

music_summary <- t(as.matrix(table(dat$music_method)))
music_summary <- t(as.matrix(music_summary[,-1]))
music_summ_table <- ggtexttable(music_summary)

How did you find Alex G?

First, we gotta do some MASSIVE data cleaning, I really should not have left this open for people to put whatever they want, look at all the unique comments, as of writing there is 37. So we gotta find a way to generalize these. That said, some are very interesting.

unique(dat$discover_method)
##  [1] "Through a friend"                                                               
##  [2] "Spotify/Apple Suggested Music"                                                  
##  [3] "I know him personally"                                                          
##  [4] "DIY show "                                                                      
##  [5] "Youtube"                                                                        
##  [6] "Press"                                                                          
##  [7] "Tumblr"                                                                         
##  [8] "College Radio"                                                                  
##  [9] "Reddit"                                                                         
## [10] "Compilation Albums"                                                             
## [11] "I read an article from Pitchfork about the release of DSU"                      
## [12] "Suggested by another artist I listen to"                                        
## [13] "Pitchfork"                                                                      
## [14] "SoundCloud"                                                                     
## [15] "Toured Together"                                                                
## [16] "Crush at the time"                                                              
## [17] "another FB group"                                                               
## [18] ""                                                                               
## [19] "4chan"                                                                          
## [20] "Soundcloud"                                                                     
## [21] "saw his name on a festival lineup"                                              
## [22] "Soundcloud recommendation"                                                      
## [23] "Flaked on Netflix"                                                              
## [24] "A girl in twitter who made playlists"                                           
## [25] "Record store clerk"                                                             
## [26] "through music blogs like NME or pitchfork"                                      
## [27] "Radio"                                                                          
## [28] "Nintendo 64 cover on a Ztapes compilation"                                      
## [29] "pitchfork's rocket review, sorry!"                                              
## [30] "8tracks.com"                                                                    
## [31] "Wikipedia"                                                                      
## [32] "Festival Lineup"                                                                
## [33] "Music journalists/festivals"                                                    
## [34] "Saw him perform at a festival"                                                  
## [35] "Music blog"                                                                     
## [36] "Facebook"                                                                       
## [37] "He performed at a festival I attended"                                          
## [38] "lofi record labels (specifically birdtapes and orchid tapes)"                   
## [39] "From a vine"                                                                    
## [40] "Through facebook friends"                                                       
## [41] "tumblr"                                                                         
## [42] "Through my brother (maybe this comes under friend?)"                            
## [43] "420 Love Songs Compilation (wasnt sure if you meant Alex G compilations sorry!)"
## [44] "Through the Spotify playlists of a YouTuber I like"                             
## [45] "i really canâ\200\231t remember "                                                     
## [46] "radio podcast"                                                                  
## [47] "at a concert (Living Bread, brooklyn, may 2013)"                                
## [48] "was the support at a show i was at "                                            
## [49] "Flake"                                                                          
## [50] "GTA V lol"

Cleaning the data so its presentable

If you’re viewing the version without the code you might not know what is going on here, but basically I am using key words in all of the non-normal options to fit them into more normal categories. For example if the answer includes a website, I detect the unique letters in that character string (“umblr” fo Tumblr) and then rename that answer to “other website”. This is a more interesting section to look at via the coded version of this file.

#I'm trying to avoid making all lower and then upper
websites <- c("umblr", "ound", ".com","iki", "vine", "acebook", "FB", "blog", "chan", "You", "eddit", "etflix")
online_personality <- c("witter", "artist ZI", "ouTuber")
friends <- c("other", "Crush", "clerk")
radio <- "adio"
journalism <- c("itchfork", "ress", "review")
live_music <- c("estival", "oncert", "oured", "show")
compilations_lab <- c("ompilation", "label")

dat_disc <- mutate(dat, discover_method = case_when(
  str_detect(discover_method, paste(websites, collapse = "|")) ~ "Other Website",
  str_detect(discover_method, paste(online_personality, collapse = "|")) ~ "Online Personality",
  str_detect(discover_method, paste(journalism, collapse = "|")) ~ "Journalism",
  str_detect(discover_method, paste(friends, collapse = "|")) ~ "Through a friend",
  str_detect(discover_method, paste(live_music, collapse = "|")) ~ "Live Music",
  str_detect(discover_method, paste(radio, collapse = "|")) ~ "Radio",
  str_detect(discover_method, paste(compilations_lab, collapse = "|")) ~ "Compilations/Label",
  T ~ discover_method
)) %>% filter(!discover_method %in% c("", "i really can’t remember ")) %>%
  select(Timestamp, discover_method)


unique(dat_disc$discover_method)
##  [1] "Through a friend"              "Spotify/Apple Suggested Music"
##  [3] "I know him personally"         "Live Music"                   
##  [5] "Other Website"                 "Journalism"                   
##  [7] "Radio"                         "Compilations/Label"           
##  [9] "Online Personality"            "Flake"                        
## [11] "GTA V lol"

Making the plot now simplified including a table of how we listen to our music

discover_g <- ggplot(dat_disc, aes(x = discover_method)) +
  geom_bar()  +
  coord_flip() +
  xlab("How did you discover Alex G") +
geom_text(stat='count', aes(label=..count..), nudge_y = 1.8) +
  labs(title = "How did we discover Alex G?",
       subtitle = paste("N =", nrow(dat_disc))) +
  theme_classic()

grob_mus <- ggarrange(discover_g, music_summ_table, nrow = 2, heights = c(1, .3))


annotate_figure(grob_mus, bottom = "How we listen to music")

Alex G Music

Now we’re getting into that good Alex G data, favorite albums, favorite songs, and more!

Favorite Album

##Filter down to albums, songs, and music videos to make an object usable in the next couple chunks
songs_dat <- dat %>%
  select(Timestamp,
         favorite_song,
         favorite_album,
         unreleased_album,
         music_vid) %>% 
  mutate(music_vid = str_to_title(music_vid),
         music_vid = ifelse(str_detect(music_vid, "Poison"), "Poison Root", music_vid))

Favorite Album Plot

A barplot of favorite released Alex G albums

album_plot <- ggplot(songs_dat, aes(x = forcats::fct_infreq(favorite_album)),
      fill = forcats::fct_infreq(favorite_album)) +
      geom_bar(aes(fill = forcats::fct_infreq(favorite_album))) +
      scale_fill_manual(
      values = c("deepskyblue1",
                 "royalblue4",
                 "springgreen4",
                 "deeppink1",
                 "red3",
                 "darkorchid4",
                 "black",
                 "grey68",
                 "yellow3"
                )
            ) +
    theme_classic(base_size = 14)  + 
   theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        legend.title = element_blank()) +
    labs(title = "Favorite Alex G Album",
         subtitle = paste0("N = ",nrow(songs_dat))) +
    geom_text(stat='count', aes(label=..count..), nudge_y = 2.5)

album_plot

ggsave("outputs/favorite_album.png", height = 7, width = 7)

Favorite Unreleased Album/Compilation

A barplot of favorite Alex G Fan compilations, also for those of you that don’t listen to his unreleased stuff, you should give it a listen.

fan_comp_plot <- ggplot(songs_dat, aes(x = forcats::fct_infreq(unreleased_album))) +
  geom_bar(aes(fill = forcats::fct_infreq(unreleased_album))) +
  theme_classic(base_size = 14)  +
   theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        legend.title = element_blank()) +
    labs(title = "Favorite Alex G Fan Compilation",
         subtitle = paste0("N = ",nrow(songs_dat))) +
    geom_text(stat='count', aes(label=..count..), nudge_y = 2.5)

fan_comp_plot

#Save it
ggsave("outputs/favorite_fancomp.png", height = 7, width = 7)

Overall it seems that people don’t listen to them, but as I said I suggest it. The winner of them is by a large margin Monsterhead, which makes sense to me. Out of all his compilations, I feel it is the most stacked with well known unreleased tracks (Nintendo 64, Uh, Written in Blood, etc.)

Favorite Music Video

music_vid_plot <- ggplot(songs_dat, aes(x = forcats::fct_infreq(music_vid))) +
  geom_bar(aes(fill = forcats::fct_infreq(music_vid))) +
  theme_classic(base_size = 14)  +
   theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        legend.title = element_blank()) +
    labs(title = "Favorite Alex G Music Video",
         subtitle = paste0("N = ",nrow(songs_dat))) +
    geom_text(stat='count', aes(label=..count..), nudge_y = 2.5)

music_vid_plot

#Save it
ggsave("outputs/favorite_music_vid.png", height = 7, width = 7)

I mean, Gretel’s video is really good

##  Prepare song data for plotting

#Read in songs for albums, A CSV I made.
album_songs <- read.csv("inputs/alex_g_album_songs.csv", stringsAsFactors = F)
album_songs <- album_songs %>% mutate_all(str_to_title)


#Fixing various naming issues
songs_dat <- songs_dat %>% mutate(favorite_song = case_when(
                                    str_detect(favorite_song, "Bonus") ~ "Good",
                                    str_detect(favorite_song, "House") ~ "Sugarhouse",
                                    TRUE ~ favorite_song
                                  ),
                                  favorite_song = str_to_title(str_trim(favorite_song, side = "both")))

#Get count of songs
songs_only <- as.data.frame(table(songs_dat$favorite_song), stringsAsFactors = F)


#Color songs based on album source
songs_only <- songs_only %>% mutate(color = case_when(
                                    Var1 %in% album_songs$house_of_sugar ~ "darkorchid4",
                                    Var1 %in% album_songs$rocket_ ~ "red3",
                                    Var1 %in% album_songs$beach_music ~ "royalblue4",
                                    Var1 %in% album_songs$dsu ~ "springgreen4",
                                    Var1 %in% album_songs$trick ~ "deepskyblue1",
                                    Var1 %in% album_songs$rules ~ "black",
                                    Var1 %in% album_songs$easy ~ "lightseagreen",
                                    Var1 %in% album_songs$race ~ "deeppink1",
                                    Var1 %in% album_songs$winner ~ "yellow4",
                                    TRUE ~ "gray34"
                                  ))

#Clean up data
songs_clean <- songs_only %>%
  arrange(desc(Freq)) %>%
  filter(Var1 != "")

Word Cloud of Favorite Alex G Songs

A word cloud of everyone’s favorite Alex G songs from the survey, colored by what album they’re in, size based on the number of people that selected a given song as their favorite. So as you can see the top two favorite songs of those who answered the survey is Snot and Gnaw, and tbh, I’m SNOT surprised. Ugh, anyways the thing I actually love about this image is, because of how Alex names his songs, there are cool phrases that are generated. Personally I love “Screwy People Change My Mind”, it’s just fun sticking his song titles together.

#Word cloud for markdown

wordcloud(words = songs_only$Var1, freq = songs_only$Freq, min.freq = 1,
                            max.words=200, random.order=FALSE, rot.per=0.35, 
                            colors=songs_only$color, ordered.colors = T)

# #Print word cloud
# png("outputs/favorite_song_cloud.png", units="in", width=5, height=5, res=300)
# wordcloud(words = songs_only$Var1, freq = songs_only$Freq, min.freq = 1,
#                             max.words=200, random.order=FALSE, rot.per=0.35, 
#                             colors=songs_only$color, ordered.colors = T)
# dev.off()

A bar plot of the top 10 favorite Alex G songs, just to ascribe hard numbers to the data, which is the issue with word clouds as said before.

top_ten_songs <- head(songs_clean, 10)

ggplot(top_ten_songs, aes(x = reorder(Var1, -Freq), Freq)) +
  geom_col(fill = top_ten_songs$color) +
  labs(
    title = "Top 10 Favorite Alex G Songs",
    subtitle = paste("Total N =", sum(songs_clean$Freq), "; In graph N =", sum(top_ten_songs$Freq))
  ) + ylab("count") +
  theme_classic(base_size = 14)  +
  theme(axis.title.x=element_blank(),
                           axis.text.x = element_text(angle = 90, hjust = 0))

ggsave("outputs/top_10_songs.png", height = 7, width = 7)

Looks like Gnaw is our winner!

An analysis of favorite song vs favorite album

A quick question I wanted to ask is: “Is it more likely that your favorite song is on your favorite album?” Which is what I did below. I made sure to only include favorite songs on specific albums as with unreleased material stuff gets dicey. I ran a chi-square, which has the null hypothesis “There should be no correlation between one’s favorite song and their favorite album” meaning that the count should be split between “Yes, my favorite song is on my favorite album” and “No, my favorite song is not on my favorite album” and it turns out….

### Make chi-square analysis
songs_dat_test <- songs_dat %>% mutate(favorite_album_csv = case_when(
  favorite_album %in% "Rocket" ~ "rocket_",
  TRUE ~ str_replace_all(str_to_lower(favorite_album), " ", "_")),
  #I don't know if there is a better way to do below, because I can't use favorite_album_csv to subset
  fav_song_fav_album = case_when(
    favorite_song %in% album_songs$house_of_sugar & favorite_album_csv == "house_of_sugar" ~ T,
    favorite_song %in% album_songs$rocket_ & favorite_album_csv == "rocket_" ~ T,
    favorite_song %in% album_songs$beach_music & favorite_album_csv == "beach_music" ~ T,
    favorite_song %in% album_songs$dsu & favorite_album_csv == "dsu" ~ T,
    favorite_song %in% album_songs$trick & favorite_album_csv == "trick" ~ T,
    favorite_song %in% album_songs$rules & favorite_album_csv == "rules" ~ T,
    favorite_song %in% album_songs$easy & favorite_album_csv == "easy" ~ T,
    favorite_song %in% album_songs$winner & favorite_album_csv == "winner" ~ T,
    favorite_song %in% album_songs$race & favorite_album_csv == "race" ~ T,
    T ~ F
    )
  ) %>% filter(favorite_song %in% as.vector(unlist(album_songs)), favorite_song != "")




table(songs_dat_test$fav_song_fav_album)
## 
## FALSE  TRUE 
##   112    64
favorite_song_chi <- chisq.test(table(songs_dat_test$fav_song_fav_album))

favorite_song_chi
## 
##  Chi-squared test for given probabilities
## 
## data:  table(songs_dat_test$fav_song_fav_album)
## X-squared = 13.091, df = 1, p-value = 0.0002967

It is more likely that you will not have your favorite song in your favorite album, statistically speaking. Frankly if you compare favorite songs to favorite albums I’m just gonna blame Gnaw for this one.

The spotifyr package

I would first like to thank a friend of mine, Stephanie Y., who looked over this markdown for me and suggested this package.

You might be wondering, “what is spotifyr?”, and to put it simply, it is an R package that allows me to effectively use Spotify’s API to get data about Alex G and his music. While I could do a lot with this data, what I’m most interested in is the different “qualities” that Spotify attributes to various songs. This is done using machine-learning on the songs to determine their “valence” (sad to happy; 0 to 1), danceability (0 to 1), energy (0 to 1), etc. I’m going to make tables showing the overall top and bottom songs for each of these categories, and honestly, I’m dubious a. hell about these. For example, they have a quality known as “liveness” which determine if the song is recorded live or not. The number 1 “liveness” song is Brick and Sugar House - LIVE, is third! We also have loudness in decibels which goes from Clouds (thanks Luke) to Brick. But if you want to know more click here, its interesting anyways.

note: All scales are from 0-1 except loudness which is decibels.

note 2: PUT RACE ON SPOTIFY PLEASE

spot_cred <- read.csv("inputs/spotifyr_dat.csv", fileEncoding = "UTF-8-BOM")

Sys.setenv(SPOTIFY_CLIENT_ID = spot_cred$client_id)
Sys.setenv(SPOTIFY_CLIENT_SECRET = spot_cred$shhh)
access_token <- get_spotify_access_token()

Generating tables with the top 5 and bottom 5 tracks for a select handful of Spotify qualities.

alex_g <- get_artist_audio_features('alex g')
#Remove duplicate tracks
alex_g <- alex_g %>% select(-album_id, -track_id, -analysis_url, - available_markets,
                                   -track_href, -track_preview_url, -track_uri, -external_urls.spotify,
                                   -album_images) %>% distinct() %>% filter(!str_detect(track_name, "onus"))

top_bot_n <- function(df, col, n = 10){
  col <- enquo(col)
  top <- df %>% arrange(desc(!!col)) %>% slice_head(n = n)
  bot <- df %>% arrange(desc(!!col)) %>% slice_tail(n = n)
  return(rbind(top,bot))
}
energy_tab <- top_bot_n(alex_g, energy, n = 5) %>% select(`Energy Song Name` = track_name,Energy = energy)
danceability_tab <- top_bot_n(alex_g, danceability, n = 5) %>% select(`Danceability Song Name` = track_name, Danceability = danceability)
valence_tab <- top_bot_n(alex_g, valence, n = 5) %>% select(`Valence Song Name` = track_name, Valence = valence)
acousticness_tab <- top_bot_n(alex_g, acousticness, n = 5) %>% select(`Acousticness Song Name` = track_name, Acousticness = acousticness)
liveness_tab <- top_bot_n(alex_g, liveness, n = 5) %>% select(`Liveness Song Name` = track_name, Liveness = liveness)
loudness_tab <- top_bot_n(alex_g, loudness, n = 5) %>% select(`Loudness Song Name` = track_name, Loudness = loudness)
ggtexttable(cbind(energy_tab, valence_tab))

ggtexttable(cbind(danceability_tab, loudness_tab))

Now that you’ve probably drawn your own opinions about the validity of this data (reminder: these are the top 5 and bottom 5 songs per song quality) let’s look over Alex’s albums and then we’ll see how this intersects with our music.

Alex G albums as an emotional rollercoaster

Technically not the best way to represent the data but…..

##  Add plotting colors
alb_df <-  data.frame(factor(c("House of Sugar", "Rocket", "Beach Music", "Dsu", "Trick", "Rules")),
            c("darkorchid4", "red3", "royalblue4", "springgreen4", "deepskyblue1", "black"))
colnames(alb_df) <- c("album_names", "album_color")


alex_g$album_name <- factor(alex_g$album_name, alb_df$album_names)

ggplot(alex_g, aes(x = valence, y= ..scaled.., fill = album_name)) +
  geom_density() +
  facet_grid(album_name~.) +
  scale_fill_manual(values = alb_df$album_color) +
  theme_classic() +
  theme(legend.position = "none") +
  ylab("Scaled Density (Valence aka Happiness)") +
  xlab("Track Order")

  labs(title = "Valence as an emotional rollercoaster for each album")
## $title
## [1] "Valence as an emotional rollercoaster for each album"
## 
## attr(,"class")
## [1] "labels"
ggplot(alex_g, aes(x = track_number, y= valence, fill = album_name)) +
  geom_bar(stat = "identity") +
  facet_grid(album_name~.) +
  scale_fill_manual(values = alb_df$album_color) +
  scale_x_continuous(breaks = 1:16)+
  theme_classic() +
  ylab("Valence (0-1)") +
  xlab("Track Number") +
  theme(legend.position = "none") +
  labs(title = "Valence barplot for each Alex G album")

Hearing loss due to Alex G albums (loudness)

SHUT UP SHUT UP

note: Remember, the value is in decibels, the closer to 0 the louder it is

ggplot(alex_g, aes(x = track_number, y= loudness, fill = album_name)) +
  geom_bar(stat = "identity") +
  facet_grid(album_name~.) +
  scale_fill_manual(values = alb_df$album_color) +
  scale_x_continuous(breaks = 1:16) +
  theme_classic() +
  ylab("Loudness (Decibels)") +
  xlab("Track Number") +
  theme(legend.position = "none")+
  labs(title = "Loudness barplot for each Alex G album")

I find the overall increase of volume over his career interesting for two reasons. First, Alex mixed and mastered his songs until Beach Music, which was engineered by someone at Domino Records, and the Domino albums seem to have more consistent volume level. Second, it is known that overtime music has only gotten louder, and it seems to be true for Alex as well. Actually let us test this real quick

Has Alex gotten louder over time?

alex_g_avg <- alex_g %>% group_by(album_name) %>% summarise(album_release_year = mean(album_release_year), avg_loudness = mean(loudness))

summary(lm(loudness ~ album_release_year, data = alex_g))
## 
## Call:
## lm(formula = loudness ~ album_release_year, data = alex_g)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.2013  -1.5312   0.3082   1.9228   6.1427 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -1058.7570   259.5210  -4.080 0.000106 ***
## album_release_year     0.5212     0.1288   4.046 0.000119 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.974 on 80 degrees of freedom
## Multiple R-squared:  0.1699, Adjusted R-squared:  0.1595 
## F-statistic: 16.37 on 1 and 80 DF,  p-value: 0.0001194
ggplot(alex_g, aes(x = album_release_year, y = loudness, fill = album_name)) +
  geom_boxplot() +
  scale_fill_manual(values = alb_df$album_color) +
  theme_classic() +
  labs(title = "Has Alex gotten louder over time?",
       fill = "Albums")

So, to an extent, yes the theory is correct, his music seems to get louder as time passes, and at least Rules and Trick had a relatively larger distribution of volume levels compared to his later work.

Energy

ggplot(alex_g, aes(x = track_number, y= energy, fill = album_name)) +
  geom_bar(stat = "identity") +
  facet_grid(album_name~.) +
  scale_fill_manual(values = alb_df$album_color) +
  scale_x_continuous(breaks = 1:16) +
  theme_classic() +
  ylab("Energy (0-1)") +
  xlab("Track Number") +
  theme(legend.position = "none") +
  labs(title = "Energy barplot for each Alex G album")

Danceability

ggplot(alex_g, aes(x = track_number, y= danceability, fill = album_name)) +
  geom_bar(stat = "identity") +
  facet_grid(album_name~.) +
  scale_fill_manual(values = alb_df$album_color) +
  scale_x_continuous(breaks = 1:16) +
  theme_classic() +
  ylab("Danceability (0-1)") +
  xlab("Track Number") +
  theme(legend.position = "none") +
  labs(title = "Danceability barplot for each Alex G album")

Do Alex G members have a musical preference?

Is there a secret formula to Alex G’s top hits? Maybe we can use this data to figure it out! By intersecting the different Spotify qualities and 666posting members favorite songs we can see if the distribution of favorites is different than Alex’s discography. Remember though, this will be excluding non-officially released tracks.

songs_only <- songs_dat %>% select(Timestamp, track_name = favorite_song) %>% mutate(track_name = case_when(
  track_name == "Sugar House" ~ "SugarHouse - Live",
  T ~ track_name
))

filtered_songs <- left_join(songs_only,alex_g) %>% drop_na() %>% select(track_name, danceability, energy, valence) %>% mutate(source = "Fans")
filtered_alex_g <- alex_g %>% select(track_name, danceability, energy, valence) %>% mutate(source = "Discography")
compare_fans <- rbind(filtered_songs, filtered_alex_g)
melt_fans <- melt(compare_fans)

ggplot(melt_fans, aes(x = value, ..scaled.., fill = source)) + geom_density(alpha = .5) +
  facet_grid(variable~.) +
  theme_classic() +
  labs(title = "Comparing Spotify song qualities of Alex's discography to fan favorite songs",
       subtitle = paste0("N = ",nrow(filtered_songs)),
       fill = "Fan favorites or
total discography") +
  ylab("Density") +
  xlab("Value of Quality") 

ks.test(filtered_songs$danceability,filtered_alex_g$danceability)
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  filtered_songs$danceability and filtered_alex_g$danceability
## D = 0.1733, p-value = 0.08157
## alternative hypothesis: two-sided
ks.test(filtered_songs$energy,filtered_alex_g$energy)
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  filtered_songs$energy and filtered_alex_g$energy
## D = 0.084243, p-value = 0.8441
## alternative hypothesis: two-sided
ks.test(filtered_songs$valence,filtered_alex_g$valence)
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  filtered_songs$valence and filtered_alex_g$valence
## D = 0.076701, p-value = 0.9127
## alternative hypothesis: two-sided

Well these are interesting distributions. Alex has a somewhat normal curve for danceability, but there seems two be a bimodal distribution amongst fan favorite songs. As a result, it does lean more towards danceability preference. With energy people prefer lower energy, with fan preference taking a dip around 0.8. Lastly with valence, which I will remind is basically happiness, there is a low peak around .30 of sad peep with some happy nerds around .75, breaking through the discography distribution. Statistically speaking, using Kolmogorov-Smirnov tests, which tests if there is a similar distribution, we see that there is no significant difference, but that bimodal distribution from danceability trends towards signifigance.

Personality Questions

##Pull personality data
pers_dat <- dat %>% select(Timestamp, mbti, hp_house, sign, weed_use, alc_use) 


#Figure size
hght <- 7.29
wdth <- 4.5

Myers Briggs

Myer-Briggs is the zeitgeist when it comes to personality tests. For more information on them check out this website: https://www.16personalities.com/free-personality-test . While there are sixteen personalities, they are not evenly distributed in the population according to data from the official MBTI types, some make over 10% of the population others closer to 2%. These types can also be broken down further into their more discrete types based on the letters, Extrovert vs Introvert (E v I) for example. I have two figures below showing both of these distributions in the general population. We will then compare this to the Alex G data.

General Population

Not really sure what the best question to ask here is besides the distribution and how the distribution differs from the normal world. Data pulled from here

#Reduced the the percentaghe of ISFJ ESFJ and ISTJ because everywhere I look the percent is over 100
mb_types = c( .137,
              .122,
              .115,
              .088,
              .087,
              .085,
              .081,
              .054,
              .044,
              .043,
              .033,
              .032,
              .025,
              .021,
              .018,
              .015)
  
mb_names = c("ISFJ",                  
            "ESFJ",                 
            "ISTJ",                 
            "ISFP",             
            "ESTJ",             
            "ESFP",             
            "ENFP",             
            "ISTP",         
            "INFP",         
            "ESTP",         
            "INTP",         
            "ENTP",         
            "ENFJ",     
            "INTJ",    
            "ENTJ",
            "INFJ")

names(mb_types) = mb_names

mb_df <- as.data.frame(mb_types) %>%
  rownames_to_column(var = "mbtis") %>%
  rename(mbti_percent = mb_types) %>%
  mutate(IE = ifelse(str_detect(mbtis, "I"), "I", "E"),
         SN = ifelse(str_detect(mbtis, "S"), "S", "N"),
         FT = ifelse(str_detect(mbtis, "F"), "F", "T"),
         JP = ifelse(str_detect(mbtis, "J"), "J", "P"))

#Better way to do this?
EI <- mb_df %>% group_by(IE) %>% summarise(sum(mbti_percent))
SN <- mb_df %>% group_by(SN) %>% summarise(sum(mbti_percent))
FT <- mb_df %>% group_by(FT) %>% summarise(sum(mbti_percent))
JP <- mb_df %>% group_by(JP) %>% summarise(sum(mbti_percent))
colnames(EI) <- c("type", "percent type")
colnames(SN) <- c("type", "percent type")
colnames(FT) <- c("type", "percent type")
colnames(JP) <- c("type", "percent type")


letter_df <- rbind(EI,SN,FT,JP)

category <- factor(c("E or I", "E or I", "N or S", "N or S", "F or T", "F or T", "J or P", "J or P"),
                   c("E or I", "N or S", "F or T", "J or P"))

letter_df <-  cbind(letter_df, category)

General population plots Myer-Briggs Type

#Factor levels to maintain look in plot
mb_df$mbtis <- factor(mb_df$mbtis, levels = mb_names)

gen_mbti <- ggplot(mb_df,
       aes(x = mbtis,
           y = mbti_percent,
           fill = mbtis)) +
  geom_col(color = "black") + 
  scale_y_continuous(labels = scales::percent) +
  ylab("Myer-Briggs Type Percent") +
  xlab("Myer-Briggs Type") 

gen_mbti +
  labs(title = "Population Myer-Briggs Types",
       subtitle = "N Unknown, from studies between 1972-2002 (pls update website)") +
  theme_classic()

gen_mbti_2 <- gen_mbti +
    labs(title = "Population Myer-Briggs Types",
         subtitle = "N Unknown") +
    theme_classic() + coord_flip()
ggsave(filename = "outputs/myer-brigg-gen-pop.png", width = hght, height = wdth)

General population letter types

letter_df$type <- factor(letter_df$type, c("E", "I", "N", "S", "F", "T", "J", "P"))

gen_lett <- ggplot(letter_df, aes(x = category, y = `percent type`, fill = type)) +
  geom_bar(position = "stack", stat = "Identity") +
  scale_fill_brewer(palette = "Set1", name = "Discrete Types", labels = c("Extrovert", "Introvert",
                                                          "Intuitives", "Sensors",
                                                          "Feeling", "Thinking",
                                                          "Judging", "Percieving"
                                                          ))

gen_lett  +
  labs(title = "Population Myer-Briggs Single Letter Types",
       subtitle = "N Unknown, from studies between 1972-2002") +
  theme_classic() + scale_y_continuous(labels = scales::percent)

gen_lett_2 <- gen_lett +  
  labs(title = "Population Letter Types",
       subtitle = "N Unknown, studies between 1972-2002") +
  theme_classic() + scale_y_continuous(labels = scales::percent)

ggsave(filename = "outputs/myer-brigg-lett-gen-pop.png", width = hght, height = wdth)

Now the 666posting population

#### Prepare data
##Alex Data

mbti_dat <- as.data.frame(table(pers_dat$mbti), stringsAsFactors = F) %>%
                  filter(!(Var1 %in% c("",
                                       "Prefer not to answer",
                                       "I don't know"))) %>%
                  rename(mbtis = Var1, count = Freq) %>%
                  mutate(mbti_percent = count/sum(count)) 


#They're missing data because of people gotta fill in some blanks
missing_dat <- data.frame(as.character(mb_df$mbtis[!(mb_df$mbtis %in% mbti_dat$mbtis)]),
                          rep(0, length(as.character(mb_df$mbtis[!(mb_df$mbtis %in% mbti_dat$mbtis)]))),
                          rep(0, length(as.character(mb_df$mbtis[!(mb_df$mbtis %in% mbti_dat$mbtis)]))))

colnames(missing_dat) <- c("mbtis", "count", "mbti_percent")

mbti_dat <- rbind(mbti_dat, missing_dat)

mbti_dat$mbtis <- factor(mbti_dat$mbtis, levels = mb_names)
#### Single Letter Myer-Briggs Data
mbti_dat <- mbti_dat %>%
  mutate(IE = ifelse(str_detect(mbtis, "I"), "I", "E"),
         SN = ifelse(str_detect(mbtis, "S"), "S", "N"),
         FT = ifelse(str_detect(mbtis, "F"), "F", "T"),
         JP = ifelse(str_detect(mbtis, "J"), "J", "P"))


#Better way to do this?
EI_g <- mbti_dat %>% group_by(IE) %>% summarise(sum(count),sum(mbti_percent))
SN_g <- mbti_dat %>% group_by(SN) %>% summarise(sum(count),sum(mbti_percent))
FT_g <- mbti_dat %>% group_by(FT) %>% summarise(sum(count),sum(mbti_percent))
JP_g <- mbti_dat %>% group_by(JP) %>% summarise(sum(count),sum(mbti_percent))
colnames(EI_g) <- c("type", "count type", "percent type")
colnames(SN_g) <- c("type", "count type", "percent type")
colnames(FT_g) <- c("type", "count type", "percent type")
colnames(JP_g) <- c("type", "count type", "percent type")


letter_df_g <- rbind(EI_g,SN_g,FT_g,JP_g)

category <- factor(c("E or I", "E or I", "N or S", "N or S", "F or T", "F or T", "J or P", "J or P"),
                   c("E or I", "N or S", "F or T", "J or P"))

letter_df_g <-  cbind(letter_df_g, category)

Run the Chi-square analysis against general population probabilities

##Running the chi-square 

#mbti_dat$mbtis == mb_df$mbtis

mbti_dat <- mbti_dat[match(mb_df$mbtis, mbti_dat$mbtis),]

#mbti_dat$mbtis == mb_df$mbtis

mbti_chisq <- chisq.test(mbti_dat$count, p = mb_df$mbti_percent, simulate.p.value = T)


#Can't lump all letters so here we go slow again
EI_chisq <- chisq.test(letter_df_g$`count type`[1:2],
                       p = letter_df$`percent type`[1:2],
                       simulate.p.value = T)

NS_chisq <- chisq.test(letter_df_g$`count type`[3:4],
                       p = letter_df$`percent type`[3:4],
                       simulate.p.value = T)

FT_chisq <- chisq.test(letter_df_g$`count type`[5:6],
                       p = letter_df$`percent type`[5:6],
                       simulate.p.value = T)

JP_chisq <- chisq.test(letter_df_g$`count type`[7:8],
                       p = letter_df$`percent type`[7:8],
                       simulate.p.value = T)

pvals_lett <- format.pval(c(EI_chisq$p.value, NS_chisq$p.value, FT_chisq$p.value, JP_chisq$p.value), 3)

Plotting MBTI Types

##Plotting the data
surv_mbti <- ggplot(mbti_dat,
       aes(x = mbtis,
           y = count,
           fill = mbtis)) +
  geom_col(color = "black") +
  xlab("Myers–Briggs Type") +
  ylab("Count")  +
  theme_classic() 
surv_mbti +
  labs(title = "666posting Myer-Briggs Types",
       subtitle = paste0("N = ",
                         sum(mbti_dat$count),
                         " ; Chi-Square compared to population p-value = ",
                         format.pval(mbti_chisq$p.value, digits = 3)))

ggsave(filename = "outputs/myer-brigg-666.png", width = hght, height = wdth)

surv_mbti_2 <- surv_mbti +
                labs(title = "666posting Myer-Briggs Types",
                     subtitle = paste0("N = ",
                                       sum(mbti_dat$count),
                                       " ; Chi-Square p-value = ",
                                       format.pval(mbti_chisq$p.value, digits = 3)))  + coord_flip()

Plotting Single MBTI Letter Types

letter_df_g$type <- factor(letter_df_g$type, c("E", "I", "N", "S", "F", "T", "J", "P"))

surv_lett <- ggplot(letter_df_g, aes(x = category, y = `percent type`, fill = type)) +
  geom_bar(position = "stack", stat = "Identity") +
  scale_fill_brewer(palette = "Set1", name = "Discrete Types",
                    labels = c("Extrovert", "Introvert",
                                "Intuitives", "Sensors",
                                "Feeling", "Thinking",
                                "Judging", "Percieving"
                                                          )) +
  theme_classic() + scale_y_continuous(labels = scales::percent)
  
surv_lett  +
  labs(title = "666posting Myer-Briggs Single Letter Types",
       subtitle = paste("P-values per letter group in respective order:",
                        paste(pvals_lett, collapse = ' ; '))) 

surv_lett_2 <- surv_lett  +
  labs(title = "666posting Letter Types",
       subtitle = paste("P-values",
                        paste(pvals_lett, collapse = ' ; '))) 


ggsave(filename = "outputs/myer-brigg-lett-666.png", width = hght, height = wdth)

MBTI Analysis

The group, compared to the general population, well, they look NOTHING alike. While the population data is relatively outdated cough cough hasn’t been updated since 2002 , our distribution is not close to it at all, below I will have the outputs side by side. A good portion of our members come from the rarer types, and specifically, 18% of participating members are the rarest type, INFJ, which makes up ~2% of the normal population. Crazy. This, of course, carries over to the discrete types as well, the starkest difference is the Intuitives considerably outnumber the Sensors in this group (I don’t know what that means, but if you want to explain it in the comments feel free to), which is the opposite in the general population. It’s wacky folks.

gpt_mbti <- ggarrange(gen_mbti_2, surv_mbti_2,
                      ncol = 2, common.legend = T)


annotate_figure(gpt_mbti, top = text_grob(
  "Comparison of Myer-Briggs Types"
))

gpt_lett <- ggarrange(gen_lett_2, surv_lett_2,
                      ncol = 2, common.legend = T)


annotate_figure(gpt_lett, top = text_grob(
  "Comparison of Myer-Briggs Letters"
))

Note: The reason the P-values are the same is that I have the chi-square set to simulate.pvalues as the distribution of values throws a warning otherwise

Birthday/Zodiacs

Gonna be honest, I am under qualified to say anything about what zodiac signs mean, so I’m going to leave these results up to your interpretation. That said, one thing I do know is these signs break up into the four classic elements, fire, water, air, and earth which are included in my analysis. So to generate the population data I pulled CDC birth records from 2000-2014 as an estimate. There appears to be a relatively uniform distribution of signs, and similarly elements in the general population.

General Population

Pulling public data to make a fair general population example and then preparing it

us_birth <- read.csv(
  "https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_2000-2014_SSA.csv")

all_births = sum(us_birth$births)

us_birth_sum <- us_birth %>%
  mutate(sign = case_when(
  (month == 1 & date_of_month >= 21) | (month == 2 & date_of_month <= 18) ~ "Aquarius",
  (month == 2 & date_of_month >= 19) | (month == 3 & date_of_month <= 20) ~ "Pisces",
  (month == 3 & date_of_month >= 21) | (month == 4 & date_of_month <= 19) ~ "Aries",
  (month == 4 & date_of_month >= 20) | (month == 5 & date_of_month <= 20) ~ "Taurus",
  (month == 5 & date_of_month >= 21) | (month == 6 & date_of_month <= 20) ~ "Gemini",
  (month == 6 & date_of_month >= 21) | (month == 7 & date_of_month <= 22) ~ "Cancer",
  (month == 7 & date_of_month >= 23) | (month == 8 & date_of_month <= 22) ~ "Leo",
  (month == 8 & date_of_month >= 23) | (month == 9 & date_of_month <= 22) ~ "Virgo",
  (month == 9 & date_of_month >= 23) | (month == 10 & date_of_month <= 22) ~ "Libra",
  (month == 10 & date_of_month >= 23) | (month == 11 & date_of_month <= 21) ~ "Scorpio",
  (month == 11 & date_of_month >= 22) | (month == 12 & date_of_month <= 21) ~ "Sagittarius",
  (month == 12 & date_of_month >= 22) | (month == 1 & date_of_month <= 21) ~ "Capricorn"
  
)) %>% group_by(sign) %>%
  summarise(total_sign = sum(births)) %>%
  mutate(element = case_when(
    sign %in% c("Aries", "Leo", "Sagittarius") ~ "Fire",
    sign %in% c("Gemini", "Libra", "Aquarius") ~ "Air",
    sign %in% c("Taurus", "Virgo", "Capricorn") ~ "Earth",
    sign %in% c("Cancer", "Scorpio", "Pisces") ~ "Water"
  ),
  sign_percent = (total_sign/sum(us_birth$births)))

Making both colors and general population plots for zodiac signs

#Make element colors
elem_cols = c("whitesmoke", "forestgreen", "firebrick3","deepskyblue4")

##All signs of population
gen_sign <- ggplot(us_birth_sum, aes(x = sign, y = sign_percent, fill = element)) +
  geom_col(color = "black") +
  scale_fill_manual(values = elem_cols) + 
  scale_y_continuous(labels = scales::percent) +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ylab("Percent") +
  xlab("Sign") 

gen_sign +
  labs(title = "General Population Signs (births 2000-2014)",
       subtitle = paste0("N = ",
                         format(sum(us_birth_sum$total_sign),
                                big.mark = ",",
                                scientific = F)))

gen_sign_2 <- gen_sign +
              labs(title = "General Population Signs",
                   subtitle = paste0("N = ",
                                     format(sum(us_birth_sum$total_sign),
                                            big.mark = ",",
                                            scientific = F)))

ggsave(filename = "outputs/horo-signs-gen-pop.png", width = hght, height = wdth)

Also zodiac elements

#Element info
element_summ <- us_birth_sum %>%
                  group_by(element) %>%
                  summarise(count = sum(total_sign)) %>%
                  mutate(element_percent = count/sum(count))

#Element plot
gen_element<- ggplot(element_summ, aes(x = element, y = element_percent, fill = element)) +
                  geom_col(color = "black") +
                  scale_fill_manual(values = elem_cols) + 
                  scale_y_continuous(labels = scales::percent) +
                  theme_classic() +
                  ylab("Percent") + 
                  xlab("Elements") +
                   theme(legend.position = "None") 

gen_element + labs(title = "General Population Sign Elements (births 2000-2014)",
                       subtitle = paste0("N = ",
                                         format(sum(element_summ$count),
                                                big.mark = ",",
                                                scientific = F)))

gen_element_2 <- gen_element + labs(title = "General Population Sign Elements",
                       subtitle = paste0("N = ",
                                         format(sum(element_summ$count),
                                                big.mark = ",",
                                                scientific = F)))


ggsave(filename = "outputs/horo-elements-gen-pop.png", width = hght, height = wdth)

Preparing the Alex G data for the same analysis done above

In our group however, this aforementioned balanced distribution doesn’t hold true. Capricorns make up around 3% of our group while Gemini makes up ~12%. That said, this is not a significant change in the distribution according to a chi-square analysis. The same is true for elements, the change nears significance, but does not cross the arbitrary (thanks Fisher) 0.05 threshold, though overall there are less Earth and Water signs in this group.

#####Now for the Alex G group -----
#Make Zodiac table
g_Zodiac <- as.data.frame(table(pers_dat$sign), stringsAsFactors = F) %>%
                  filter(!(Var1 %in% c("", "Prefer not to answer"))) %>%
                  rename(sign = Var1, count = Freq) %>%
                  mutate(sign_percent = count/sum(count)) 

#Add sign elements
g_Zodiac <- g_Zodiac %>%
                mutate(element = case_when(
                  sign %in% c("Aries", "Leo", "Sagittarius") ~ "Fire",
                  sign %in% c("Gemini", "Libra", "Aquarius") ~ "Air",
                  sign %in% c("Taurus", "Virgo", "Capricorn") ~ "Earth",
                  sign %in% c("Cancer", "Scorpio", "Pisces") ~ "Water"
                  )
                )

#Seperate element df
g_element <- g_Zodiac %>%
                  group_by(element) %>%
                  summarise(count = sum(count)) %>%
                  mutate(element_percent = count/sum(count))
#Chi-square analysis against normal population probabilities


##Chi-squares
#Make sure in correct order, should all be TRUE
#g_Zodiac$sign == us_birth_sum$sign

sign_chisq <- chisq.test(g_Zodiac$count, p = us_birth_sum$sign_percent)
#sign_chisq
#Not sig according to probabilities 

#Elements
#g_element$element == element_summ$element

element_chisq <- chisq.test(g_element$count, p = element_summ$element_percent)
#element_chisq

Basic Alex G zodiac sign plot

#Alex G Zodiac plot
surv_sign <- ggplot(g_Zodiac, aes(x = sign, y = sign_percent, fill = element)) +
        geom_col(color = "black") + 
        scale_fill_manual(values = elem_cols) +
        scale_y_continuous(labels = scales::percent) + 
        theme_classic() +
        theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
        ylab("Percent") +
        xlab("Sign")

surv_sign  +
        labs(title = "666posting Zodiac Signs",
             subtitle = paste0("N = ",
                               sum(g_Zodiac$count),
                               " ; Chi-Square compared to population p-value = ",
                               format.pval(sign_chisq$p.value, digits = 3)))

surv_sign_2 <- surv_sign +
        labs(title = "666posting Zodiac Signs",
             subtitle = paste0("N = ",
                               sum(g_Zodiac$count),
                               " ; Chi-Square p-value = ",
                               format.pval(sign_chisq$p.value, digits = 3)))

ggsave(filename = "outputs/horo-signs-666.png", width = hght, height = wdth)

Plotting the zodiac elements of the Alex G data

#Alex G element plot
surv_element <- ggplot(g_element, aes(x = element, y = element_percent, fill = element)) +
  geom_col(color = "black") +
  scale_fill_manual(values = elem_cols) + 
  scale_y_continuous(labels = scales::percent) +
  theme_classic() +
  ylab("Percent") +
  xlab("Elements") +
  theme(legend.position = "None") 

surv_element +
  labs(title = "666posting Zodiac Sign Elements",
       subtitle = paste0("N = ",
                         sum(g_element$count),
                         " ; Chi-Square compared to population p-value = ",
                         format.pval(element_chisq$p.value, digits = 3)))

surv_element_2 <- surv_element +
  labs(title = "666posting Zodiac Sign Elements",
       subtitle = paste0("N = ",
                         sum(g_element$count),
                         " ; Chi-Square p-value = ",
                         format.pval(element_chisq$p.value, digits = 3)))
  
ggsave(filename = "outputs/horo-elements-666.png", width = hght, height = wdth)

Astrology Analysis

In our group, when compared to the general population there does seem to be an imbalance. Capricorns make up around 3% of our group while Gemini makes up ~12%. That said, this is not a significant change in the distribution according to a chi-square analysis. The same is true for elements, the change nears significance, but does not cross the arbitrary (thanks Fisher) 0.05 threshold, though overall there are less Earth and Water signs in this group.

gpt_sign <- ggarrange(gen_sign_2, surv_sign_2,
                      ncol = 2, common.legend = T)


annotate_figure(gpt_sign, top = text_grob(
  "Comparison of Astrological Signs"
))

gpt_element <- ggarrange(gen_element_2, surv_element_2,
                      ncol = 2, common.legend = T)


annotate_figure(gpt_element, top = text_grob(
  "Comparison of Astrological Elements"
))

Hogwarts House

General Population

I’m also not great with Harry Potter info, but this was relatively interesting because there is some insight to be gained. I had to eyeball the general population percentages from the article below because they didn’t actually give the true numbers What’s interesting is that the main difference between the general population and 666posting is that we have more Slytherin. What makes this even more interesting, is according to the Time’s article, Slytherin makes a large proportion of the younger population, and I’d say that this group leans to the younger side (see previous above ages). And so, I’d argue this cohort doesn’t stray from the normal population distribution when adjusted for age. But we don’t have enough older people to properly adjust for age, so just take my word for it, K? Great!

Data taken from https://drive.google.com/file/d/0B8PCmhQmtcDKLXlzSGtnZ0hKbjQ/view Data taken from Time Magazine, they don’t have the actual percents posted, but a much larger sample size.

#My best guess from eyeing The Times
house_percent <- c(.15, .30, .47, .08)
house <- c("Gryffindor", "Hufflepuff", "Ravenclaw", "Slytherin")
house_colors <- c("firebrick", "goldenrod3", "dodgerblue4", "darkgreen")
america_house <- data.frame(house, house_percent, stringsAsFactors = F) 

Re-plot the Times Data

gen_house <- ggplot(america_house,
           aes(x = house, y = house_percent, fill = house)) +
      geom_col(color = "black") +
      scale_fill_manual(values = house_colors) + 
      scale_y_continuous(labels = scales::percent) +
      xlab("Hogwarts Houses") +
      theme_classic() +
       theme(legend.position = "None") 


gen_house +
      labs(title = "US Hogwarts House Distribution (Time Magazine's Data)",
           subtitle ="N = ~500,000")

gen_house_2 <- gen_house +
      labs(title = "US Hogwarts House Distribution",
           subtitle ="N = ~500,000")+
  

ggsave(filename = "outputs/hogwarts-gen-pop.png")

Preparing and cleaning Alex G data

Now we have general population prepare 666posting data

#Should turn this into a function by this point
house_dat <- as.data.frame(table(pers_dat$hp_house), stringsAsFactors = F) %>%
                filter(!(Var1 %in% c("",
                                     "Prefer not to answer",
                                     "I don't know"))) %>%
                rename(house = Var1, count = Freq) %>%
                mutate(house_percent = count/sum(count))

Chi-Square for houses

#Sig difference
house_chisq <- chisq.test(house_dat$count, p = house_percent)
house_chisq
## 
##  Chi-squared test for given probabilities
## 
## data:  house_dat$count
## X-squared = 26.732, df = 3, p-value = 6.699e-06

Plotting Alex G Hogwarts Houses

###Plot house 
surv_house <- ggplot(house_dat,
             aes(x = house, y = count, fill = house)) +
        geom_col(color = "black") +
        scale_fill_manual(values = house_colors)  +
        xlab("Hogwarts Houses")  +
        theme_classic() +
         theme(legend.position = "None") 

surv_house + 
        labs(title = "666posting Hogwarts Houses",
             subtitle = paste0("N = ",
                               sum(house_dat$count),
                               " ; Chi-Square compared to population p-value = ",
                               format.pval(house_chisq$p.value, digits = 3)))

surv_house_2 <- surv_house +
        labs(title = "666posting Hogwarts Houses",
             subtitle = paste0("N = ",
                               sum(house_dat$count),
                               " ; Chi-Square p-value = ",
                               format.pval(house_chisq$p.value, digits = 3)))

  ggsave(filename = "outputs/hogwarts-666.png", width = hght, height = wdth)

As I said above, the only thing of note is the change in Slytherin, which I attribute to the trend that Slytherin’s are genrally younger, and this group a younger sample when compared to the overall populatio.

gpt_house <- ggarrange(gen_house_2, surv_house_2,
                      ncol = 2, common.legend = T)


annotate_figure(gpt_house, top = text_grob(
  "Comparison of Harry Potter Houses"
))

Are there correlations between types?

My last question, and the one I found most interesting is, are personality types related to one another? E.g. are certain MBTI types enriched in a given zodiac sign? Long story short, kinda, kinda not. I made a bunch of Heatmap Tables using an R package called Z-table and calculated Chi-squares and while there were a few hits I assure you none would stand up to any multiple testing correction, so I don’t have any larger statement to say here. I feel like we are underpowered in participants to test this hypothesis as not everyone answered all three survey questions, and even if we did we still might need a larger N.

note: I’m going to not echo these as the code is repetitive.

Using Ztables from this link

Ztable signs vs MBTI

hor_mbt_dat <- pers_dat %>% filter(!(mbti %in% c("", "Prefer not to answer", "I don't know")),
                                   !(sign %in% c("", "Prefer not to answer", "I don't know"))) %>%
                mutate(element = case_when(
                  sign %in% c("Aries", "Leo", "Sagittarius") ~ "Fire",
                  sign %in% c("Gemini", "Libra", "Aquarius") ~ "Air",
                  sign %in% c("Taurus", "Virgo", "Capricorn") ~ "Earth",
                  sign %in% c("Cancer", "Scorpio", "Pisces") ~ "Water"
                  )
                )

#Sign
sign_tab <- table(hor_mbt_dat$mbti, hor_mbt_dat$sign)

sign_z <- ztable(sign_tab)

chi_sign_mbti <- chisq.test(hor_mbt_dat$mbti, hor_mbt_dat$sign, simulate.p.value = T)

#No way to format this nicely
sign_z %>% makeHeatmap() %>%
  print(caption = paste0("Heatmap Table of Zodiac Signs and Myer-Briggs Types; Chi-Sqare p-value of ", format.pval(chi_sign_mbti$p.value, digits = 3)))
Heatmap Table of Zodiac Signs and Myer-Briggs Types; Chi-Sqare p-value of 0.081
  Aquarius Aries Cancer Capricorn Gemini Leo Libra Pisces Sagittarius Scorpio Taurus Virgo
ENFJ 2 2 1 0 0 0 2 0 2 1 0 1
ENFP 2 5 5 1 4 3 3 2 4 1 2 2
ENTJ 0 0 0 1 0 0 0 0 0 1 0 0
ENTP 0 0 2 0 0 1 1 0 0 1 0 1
ESFP 0 0 0 0 1 1 0 0 0 0 1 0
INFJ 1 1 2 0 4 5 2 2 6 4 2 4
INFP 10 6 1 4 1 5 5 4 7 0 4 4
INTJ 0 0 0 1 1 0 3 3 2 1 0 3
INTP 1 0 0 2 7 1 1 3 0 1 1 3
ISFJ 0 0 0 0 0 1 0 0 0 0 0 0
ISFP 0 0 0 0 1 0 0 0 0 1 1 0
ISTP 0 0 0 0 1 0 1 0 0 0 0 0
Ztable Zodiac Elements vs MBTI
Heatmap Table of Zodiac Elements and Myer-Briggs Types; Chi-Sqare p-value of 0.524
  ENFJ ENFP ENTJ ENTP ESFP INFJ INFP INTJ INTP ISFJ ISFP ISTP
Air 4 9 0 1 1 7 16 4 9 0 1 2
Earth 1 5 1 1 1 6 12 4 6 0 1 0
Fire 4 12 0 1 1 12 18 2 1 1 0 0
Water 2 8 1 3 0 8 5 4 4 0 1 0

Ztable for Zodiac Signs and HP Houses

Heatmap Table of Zodiac Signs and Hogwarts Houses; Chi-Sqare p-value of 0.968
  Aquarius Aries Cancer Capricorn Gemini Leo Libra Pisces Sagittarius Scorpio Taurus Virgo
Gryffindor 2 1 1 1 2 2 2 1 4 0 1 2
Hufflepuff 4 3 1 0 3 5 2 2 7 3 4 4
Ravenclaw 3 4 1 1 9 4 5 2 4 2 3 4
Slytherin 4 0 2 1 2 2 3 0 3 4 1 3
Ztable for Zodiac Elements and HP Houses
Heatmap Table of Zodiac Elements and Hogwarts Houses; Chi-Sqare p-value of 0.786
  Air Earth Fire Water
Gryffindor 6 4 7 2
Hufflepuff 9 8 15 6
Ravenclaw 17 8 12 5
Slytherin 9 5 5 6
MBTI vs Houses
Heatmap Table of Hogwarts Houses and Myer-Briggs Types; Chi-Sqare p-value of 0.343
  Gryffindor Hufflepuff Ravenclaw Slytherin
ENFJ 3 1 3 1
ENFP 6 6 8 7
ENTJ 0 0 0 1
ENTP 0 0 2 0
INFJ 2 11 7 4
INFP 3 13 9 6
INTJ 3 1 3 1
INTP 1 1 4 4
ISFJ 0 1 0 0
ISFP 0 1 0 0
ISTP 0 1 0 0

MBTI letters vs Zodiac Signs

Prepare data

letter_dat <- pers_dat %>% filter(!(mbti %in% c("", "Prefer not to answer", "I don't know")),
                    !(sign %in% c("", "Prefer not to answer", "I don't know"))) %>%
              mutate(IE = ifelse(str_detect(mbti, "I"), "I", "E"),
                     SN = ifelse(str_detect(mbti, "S"), "S", "N"),
                     FT = ifelse(str_detect(mbti, "F"), "F", "T"),
                     JP = ifelse(str_detect(mbti, "J"), "J", "P"))  %>%
                mutate(element = case_when(
                  sign %in% c("Aries", "Leo", "Sagittarius") ~ "Fire",
                  sign %in% c("Gemini", "Libra", "Aquarius") ~ "Air",
                  sign %in% c("Taurus", "Virgo", "Capricorn") ~ "Earth",
                  sign %in% c("Cancer", "Scorpio", "Pisces") ~ "Water"
                  )
                )
Sign and IE
Heatmap Table oF Zodiac Signs and Intro vs Extra ; Chi-Sqare p-value of 0.178
  Aquarius Aries Cancer Capricorn Gemini Leo Libra Pisces Sagittarius Scorpio Taurus Virgo
E 4 7 8 2 5 5 6 2 6 4 3 4
I 12 7 3 7 15 12 12 12 15 7 8 14
Sign and SN
Heatmap Table oF Zodiac Signs and Intro vs Extra ; Chi-Sqare p-value of 0.137
  Aquarius Aries Cancer Capricorn Gemini Leo Libra Pisces Sagittarius Scorpio Taurus Virgo
N 16 14 11 9 17 15 17 14 21 10 9 18
S 0 0 0 0 3 2 1 0 0 1 2 0
Sign and FT
Heatmap Table oF Zodiac Signs and Intro vs Extra ; Chi-Sqare p-value of 0.005
  Aquarius Aries Cancer Capricorn Gemini Leo Libra Pisces Sagittarius Scorpio Taurus Virgo
F 15 14 9 5 11 15 12 8 19 7 10 11
T 1 0 2 4 9 2 6 6 2 4 1 7
Sign and JP
Heatmap Table of Zodiac Signs and Judge vs Perceive ; Chi-Sqare p-value of 0.33
  Aquarius Aries Cancer Capricorn Gemini Leo Libra Pisces Sagittarius Scorpio Taurus Virgo
J 3 3 3 2 5 6 7 5 10 7 2 8
P 13 11 8 7 15 11 11 9 11 4 9 10

MBTI letters vs Zodiac Elements

Element EI
Heatmap Table oF Zodiac Elements and Intro vs Extra ; Chi-Sqare p-value of 0.49
  Air Earth Fire Water
E 15 9 18 14
I 39 29 34 22
Element SN
Heatmap Table oF Zodiac Elements and Intro vs Extra ; Chi-Sqare p-value of 0.853
  Air Earth Fire Water
N 50 36 50 35
S 4 2 2 1
Element FT
Heatmap Table oF Zodiac Elements and Intro vs Extra ; Chi-Sqare p-value of 0.0115
  Air Earth Fire Water
F 38 26 48 24
T 16 12 4 12
Element JP
Heatmap Table of Zodiac Elements and Judge vs Perceive ; Chi-Sqare p-value of 0.548
  Air Earth Fire Water
J 15 12 19 15
P 39 26 33 21

Hogwarts MBTI Letter Analysis

Prepare data

House IE
Heatmap Table of Hogwarts Houses and Intro vs Extra ; Chi-Sqare p-value of 0.128
  Gryffindor Hufflepuff Ravenclaw Slytherin
E 9 7 13 9
I 9 29 23 15
House SN
Heatmap Table of Hogwarts Houses and Intro vs Extra ; Chi-Sqare p-value of 0.128
  Gryffindor Hufflepuff Ravenclaw Slytherin
N 18 33 36 24
S 0 3 0 0
House FT
Heatmap Table of Hogwarts Houses and Intro vs Extra ; Chi-Sqare p-value of 0.268
  Gryffindor Hufflepuff Ravenclaw Slytherin
F 14 33 27 18
T 4 3 9 6
House JP
Heatmap Table of Hogwarts Houses and Judge vs Perceive ; Chi-Sqare p-value of 0.776
  Gryffindor Hufflepuff Ravenclaw Slytherin
J 8 14 13 7
P 10 22 23 17

Habits and Entertainment

Step one, per usual, is cleaning up the data for our specific purposes, one of the main changes is shortening of names and adding a fun variable about alcohol consumption.

dat_habits <- dat %>%
          select(id, age, gender, weed_use, alc_use, concerts, country, state) %>%
          mutate(weed_use = case_when(
                    weed_use == "I used to smoke weed, but now I don't" ~ "I don't smoke weed anymore",
                    T ~ weed_use
                  ),
                 alc_use = case_when(
                    alc_use == "I used to drink, but don't anymore" ~ "I don't drink anymore",
                    T ~ alc_use
                 ),
                 `Drinks in the USA` = ifelse(country == "United States" & 
                                           alc_use != "I never had alcohol", "Yes", "No"),
                 underage_drinking = ifelse(`Drinks in the USA` == "Yes" & age <= 20,
                                            "Yes",
                                            "No"))

Substance Usage

Let us first make some barplots of both drugs followed by a z-table to see what our distribution of recorded substance usage is. I will also run a Chi-square (because the function already does that) to see if the distribution of usage is unexpected, aka, are weed use and alcohol use random, or are they correlated in some manner. I am also ordering factors to make the tables and following graphs more readable.

dat_drugs <- dat_habits  %>% filter(alc_use != "", weed_use != "")

dat_drugs$alc_use <- factor(dat_drugs$alc_use, c("I never had alcohol",
                                                 "I don't drink anymore",
                                                 "Once a month/socially",
                                                 "Once a week",
                                                 "Multiple times per week",
                                                 "Every day",
                                                 "Multiple times a day"))

dat_drugs$weed_use <- factor(dat_drugs$weed_use, c("I have never smoked weed",
                                                 "I don't smoke weed anymore",
                                                 "Once a month",
                                                 "Once a week",
                                                 "Multiple times per week",
                                                 "Every day",
                                                 "Multiple times a day")) 

Bringing in data about weed use about the world world and the states

world_weed <- read.csv("inputs/weed_world_legality.csv", stringsAsFactors = F, encoding = "UTF-8-BOM")
colnames(world_weed)[1:3] <- c("country",
                          "recreational",
                          "medical")
world_weed <- mutate(world_weed, weed_laws = case_when(
  medical == "Illegal" ~ "not legal",
  str_detect(medical, "Legal") ~ "medical/complicated",
  recreational == "Legal" ~ "recreational",
  TRUE ~ "decriminalized/complicated"
  
  
))

state_weed <- read.csv("inputs/state_marijuana_laws_10_2016.csv", stringsAsFactors = F)
colnames(state_weed) <- c("state", "medical", "recreational", "no_laws")
state_weed <- mutate(state_weed, weed_laws = case_when(
  no_laws == "Yes" ~ "not legal",
  recreational == "Yes" ~ "recreational",
  medical == "Yes" ~ "medical"
))

Alcohol and weed barplots

alc_count <- dat_drugs %>% count(alc_use)
weed_count  <- dat_drugs %>% count(weed_use)

ggplot(alc_count, aes(x=alc_use, y = n, fill=alc_use)) +
  geom_col(color="black", ) +
  coord_flip() +
  scale_fill_viridis_d(option = "A", direction = -1)  +
  theme_classic() +
  theme(legend.position = "none") +
  geom_text(aes(label = n), nudge_y = 3, size=4) +
  labs(title = "Alcohol Usage in 666posting",
       subtitle = paste("N =", sum(alc_count$n))) +
  labs(fill = "Alcohol Usage") +
  xlab("Alcohol Usage") +
  ylab("Count of Individuals")

ggsave("outputs/alcohol_usage_barplot.png")

ggplot(weed_count, aes(x=weed_use, y = n, fill=weed_use)) +
  geom_col(color="black", ) +
  coord_flip() +
  scale_fill_viridis_d(direction = -1) +
  theme_classic() +
  theme(legend.position = "none") +
  geom_text(aes(label = n), nudge_y = 3, size=4) +
  labs(title = "Weed Usage in 666posting",
       subtitle = paste("N =", sum(weed_count$n))) +
  labs(fill = "Weed Usage")+
  xlab("Weed Usage") +
  ylab("Count of Individuals")

ggsave("outputs/weed_usage_barplot.png")

Now lets look at that Z table

auto_ztable(dat_drugs,
          "alc_use",
          "weed_use",
          "Heatmap Table of Alcohol and Weed Usage",
          simul_p = T)
Heatmap Table of Alcohol and Weed Usage ; Chi-Sqare p-value of 0.0225
  I have never smoked weed I don't smoke weed anymore Once a month Once a week Multiple times per week Every day Multiple times a day
I never had alcohol 5 2 0 0 1 0 1
I don’t drink anymore 1 6 1 0 2 1 4
Once a month/socially 10 27 12 1 7 8 10
Once a week 2 16 9 4 5 3 3
Multiple times per week 1 25 12 2 11 2 5
Every day 0 6 0 0 1 1 1
Multiple times a day 0 0 1 0 0 0 1

So it appears, according to the the Chi-Square that our data might be trending towards significance, but there is nothing to be concluded here. Overall it appears most people use to smoke weed, but not anymore, and most people seem to drink socially.

Now let us see if there are any interesting distributions with age

#Make a histogram of ages and habits
#dat_drugs$alc_use <- factor(dat_drugs$alc_use, names(sort(table(dat_drugs$alc_use), decreasing= F)))
ggplot(dat_drugs, aes(x=age, fill=alc_use)) +
  geom_histogram(breaks=seq(15, 45, by=1), color="black") +
  scale_fill_viridis_d(option = "A", direction = -1) +
  theme_classic() +
  labs(title = "Alcohol Usage Histogram in 666posting",
       subtitle = paste("N =", sum(alc_count$n)),
       fill = "Alcohol Usage")

ggsave("outputs/alcohol_histogram.png")

Above is an interesting way to view the distribution, but to see differences there are better plots such as the box-plot below:

ggplot(dat_drugs, aes(y=age, x= alc_use, fill=alc_use)) +
  geom_violin() +
  geom_sina(color = "grey40", alpha = .7) +
  stat_n_text(size = 3) +
  theme_classic() +
  scale_fill_viridis_d(option = "A", direction = -1) +
  coord_flip() +
  theme(legend.position = "none") +
  labs(title = "Alcohol Usage in 666posting",
       subtitle = paste("N =", sum(alc_count$n)),
       fill = "Alcohol Usage")

ggsave("outputs/alcohol_boxplot.png")

These distributions don’t seem that different from one another. As a test, I’ll run a linear model to see if these drinking groups are a good predictor of age. That said, I’m not really going to check any assumptions (don’t do this), I will remove the one older individual because they are an outlier, but I don’t expect there to be a difference.

no_out_drugs <- dat_drugs %>% filter(age <= 38) %>%
  mutate(center_age = age - mean(age))
no_out_drugs %>%
  group_by(alc_use)%>%
  get_summary_stats(age, type = "mean_sd")
## # A tibble: 7 x 5
##   alc_use                 variable     n  mean    sd
##   <fct>                   <chr>    <dbl> <dbl> <dbl>
## 1 I never had alcohol     age          9  21.8 5.12 
## 2 I don't drink anymore   age         15  23.7 4.17 
## 3 Once a month/socially   age         75  22.3 2.66 
## 4 Once a week             age         42  23.5 2.79 
## 5 Multiple times per week age         57  24.3 3.51 
## 6 Every day               age          9  24.7 1.94 
## 7 Multiple times a day    age          2  26.5 0.707
lm_alc <- lm(center_age ~ alc_use, data = no_out_drugs)
summary(lm_alc)
## 
## Call:
## lm(formula = center_age ~ alc_use, data = no_out_drugs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2667 -2.2667 -0.2667  2.3333 10.2667 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                     -1.5045     1.0528  -1.429   0.1545  
## alc_useI don't drink anymore     1.9556     1.3317   1.468   0.1435  
## alc_useOnce a month/socially     0.4889     1.1142   0.439   0.6613  
## alc_useOnce a week               1.6984     1.1602   1.464   0.1448  
## alc_useMultiple times per week   2.4854     1.1329   2.194   0.0294 *
## alc_useEvery day                 2.8889     1.4889   1.940   0.0537 .
## alc_useMultiple times a day      4.7222     2.4691   1.913   0.0572 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.159 on 202 degrees of freedom
## Multiple R-squared:  0.08829,    Adjusted R-squared:  0.06121 
## F-statistic:  3.26 on 6 and 202 DF,  p-value: 0.004415

Nothing significant, but social drinking trends towards younger individuals which is interesting.

And now I’m going to reveal all the filthy filthy lawbreakers in this group, I’m talking under-age drinking. (I’m being sarcastic, but I think its an interesting question)

#Alcohol based on country
dat_drugs$alc_use <- factor(dat_drugs$alc_use, names(sort(table(dat_drugs$alc_use), decreasing= F)))
ggplot(dat_drugs, aes(x=age, fill=`Drinks in the USA`)) +
  geom_histogram(breaks=seq(15, 45, by=1), color="black") +
  geom_vline(xintercept = 21, linetype = "dashed", color = "red", size = 1.4) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Woah, that's illegal buddies!",
       subtitle = paste0("I'm reporting all ",
                         sum(dat_drugs$underage_drinking == "Yes"),
                         " of you to the police! /s")) +
  theme_classic()

ggsave("outputs/underage_drinking.png")

Disgusting…

Weed usage plots

Alright on to the completely legal topic of weed usage

#Weed
#dat_drugs$weed_use <- factor(dat_drugs$weed_use, names(sort(table(dat_drugs$weed_use), decreasing= F)))
ggplot(dat_drugs, aes(x=age, fill=weed_use)) +
  geom_histogram(breaks=seq(15, 45, by=1), color="black") +
  scale_fill_viridis_d(direction = -1) +
  labs(title = "Weed Usage in 666posting",
       subtitle = paste("N =", sum(weed_count$n)),
       fill = "Weed Usage") +
  theme_classic()

ggsave("outputs/weed_usage_histogram.png")

Again the better way to view the data

ggplot(dat_drugs, aes(y=age, x= weed_use, fill=weed_use)) +
  geom_violin() +
  geom_sina(color = "grey40", alpha = .7) +
  stat_n_text(size = 3) +
  scale_fill_viridis_d(direction = -1) +
  coord_flip() +
  labs(title = "Weed Usage in 666posting",
        subtitle = paste("N =", sum(weed_count$n)),
       fill = "Weed Usage") +
  xlab("Weed Usage")  +
  theme_classic() +
  theme(legend.position = "none") 

ggsave("outputs/weed_usage_boxplot.png")

Let us test if there is a significant difference between age and weed use.

no_out_drugs %>%
  group_by(weed_use)%>%
  get_summary_stats(age, type = "mean_sd")
## # A tibble: 7 x 5
##   weed_use                   variable     n  mean    sd
##   <fct>                      <chr>    <dbl> <dbl> <dbl>
## 1 I have never smoked weed   age         19  22    3.40
## 2 I don't smoke weed anymore age         82  23.4  3.46
## 3 Once a month               age         35  23.4  3.09
## 4 Once a week                age          7  25.1  2.91
## 5 Multiple times per week    age         26  22.4  2.37
## 6 Every day                  age         15  21.9  2.15
## 7 Multiple times a day       age         25  25.0  3.35
lm_weed <- lm(center_age ~ weed_use, data = no_out_drugs)
summary(lm_weed)
## 
## Call:
## lm(formula = center_age ~ weed_use, data = no_out_drugs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.3537 -2.3537 -0.4231  1.9600  9.6463 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                        -1.28230    0.72731  -1.763  0.07940 . 
## weed_useI don't smoke weed anymore  1.35366    0.80719   1.677  0.09509 . 
## weed_useOnce a month                1.40000    0.90341   1.550  0.12278   
## weed_useOnce a week                 3.14286    1.40171   2.242  0.02604 * 
## weed_useMultiple times per week     0.42308    0.95684   0.442  0.65885   
## weed_useEvery day                  -0.06667    1.09500  -0.061  0.95151   
## weed_useMultiple times a day        3.04000    0.96489   3.151  0.00188 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.17 on 202 degrees of freedom
## Multiple R-squared:  0.08148,    Adjusted R-squared:  0.0542 
## F-statistic: 2.987 on 6 and 202 DF,  p-value: 0.008121

Weed usage interestingly shows older people in this group tend to consume weed more than their younger counterparts. I feel this is somewhat driven by outliers, but “once a week” weed usage, with no outliers, was also significant. Part of me wonders how much of this has to do with the financial situation/living situation. Perhaps older individuals smoke more because they have both the financial ability and space (free from those who might look down on it) to do so. I’m not sure, it would’ve been pretty weird if I asked “what is your yearly income?” and “Do you live with an authority figure?”.

### Add world and US weed laws

map <- match(dat_drugs$country, world_weed$country)
dat_drugs$weed_laws <- world_weed$weed_laws[map]
map <- match(dat_drugs$state, state_weed$state)
ind <- dat_drugs$country %in% "United States"
dat_drugs$weed_laws[ind] <- (state_weed$weed_laws[map])[ind]
dat_drugs <- mutate(dat_drugs, weed_laws = case_when(
  state == "District of Columbia" ~ "recreational",
  state == "" ~ "medical/complicated",
  TRUE ~ weed_laws
))
dat_drugs$weed_laws <- factor(dat_drugs$weed_laws,
                              c("not legal",
                                "medical/complicated",
                                "medical",
                                "decriminalized/complicated",
                                "recreational"))
dat_drugs <- mutate(dat_drugs, `General Weed Use` = ifelse(
    weed_use == "I have never smoked weed" | weed_use == "I don't smoke weed anymore",
    "Doesn't actively smoke weed",
    "Actively smokes weed"
  ))

Table of usage compared to laws

drug_table <- table(dat_drugs$`weed_laws`, dat_drugs$`General Weed Use`)

z <- ztable(drug_table,
            caption = "Weed use to compared to country/state laws",
            align = "ccccccc")

z$caption <-  "Weed use to compared to country/state laws"
z$position <- "ccccccc"

z %>%
  addCellColor(rows = 2, cols = 2, "lava") %>%
  addCellColor(rows = 2:6, cols = 3, "indiagreen") %>%
  addCellColor(rows = 3:4, cols = 2, "orange") %>%
  addCellColor(rows = 5:6, cols = 2, "indiagreen")
Weed use to compared to country/state laws
  Actively smokes weed Doesn't actively smoke weed
not legal 22 25
medical/complicated 34 39
medical 29 20
decriminalized/complicated 1 4
recreational 23 13

Concerts, General and Alex G Specific

In general

dat_habits$concerts <- factor(dat_habits$concerts, levels =  unique(dat_habits$concerts)[c(6,4,2,3,5,1)])


ggplot(dat_habits, aes(x=concerts, fill = concerts)) +
  geom_bar(color = "black") + coord_flip() +
  labs(title = "How often do you go to concerts?",
       subtitle = paste("N = ", nrow(dat_habits)))  +
  scale_fill_brewer(palette = "Dark2") +
  theme_classic() +
  geom_text(stat='count', aes(label=..count..), position = position_dodge(width = 1), hjust = -.08) +
  theme(legend.position = "none") 

Alex G concerts

## Preparing the data

alex_g_conc_dat <- dat %>% select(Timestamp, alex_g_concerts, state, living_area, concerts) %>%
  filter(!is.na(alex_g_concerts)) %>%
  mutate(`Lives Near Philly?` = ifelse(state %in% c("Pennsylvania",
                                                     "New Jersey",
                                                    "Delaware"), "Yes", "No"))
alex_g_conc_dat$concerts <- factor(alex_g_conc_dat$concerts, levels =  unique(alex_g_conc_dat$concerts)[c(6,4,2,3,5,1)])

Making a histogram showing the general results. As you can see there are some outliers, but I have theories on why these outliers exist.

ggplot(alex_g_conc_dat, aes(x = alex_g_concerts)) +
           geom_bar(color="black")  +
  theme_classic() +
  xlab("How many Alex G concerts have you seen?")

My theory is they live near Philly, and by using the state data I’ve collected I can determine that two of the largest outliers indeed do live near Philly (defined by living in Pennsylvania, New Jersey or Delaware). That said, the individual who has seen Alex 40 times lives in Missouri.

Fun fact: According to setlist.fm Alex has played 284 concerts

ggplot(alex_g_conc_dat, aes(x = alex_g_concerts, fill = `Lives Near Philly?`)) +
  geom_bar(color="black") +
  scale_fill_brewer(palette = "Dark2") +
  annotate(geom="text", x=c(37, 17), y=5, label=c("Lives in Missouri", "Lives in South Carolina"),
              color="Black") +
  labs(
    title = "Does seeing Alex G concerts correlate with
living near Philly?",
 subtitle = paste0("N = ", nrow(alex_g_conc_dat))
  ) +
  xlab("How many Alex G concerts have you seen?") +
  ylab("Count") +
  theme_classic()

ggsave(filename = "outputs/alex_g_conc_philly.png", width = hght, height = wdth)

But another theory I had is that maybe living near cities in general will increase the probability of an individual going to more concerts

alex_g_conc_dat$living_area <- factor(alex_g_conc_dat$living_area,
                                      names(sort(table(alex_g_conc_dat$living_area),
                                                 decreasing= F)))
ggplot(alex_g_conc_dat, aes(x = alex_g_concerts, fill = living_area)) +
           geom_bar(color="black") +
  labs(fill = "Area of Living") +
  xlab("How many Alex G concerts have you seen?") +
  theme_classic()

Violin plot of living area and going to concerts

ggplot(alex_g_conc_dat, aes(y = alex_g_concerts, x = living_area, fill = living_area)) +
   geom_violin() + 
  geom_sina() +
  coord_flip() +
  stat_n_text(size = 3) +
  theme_classic()  +
  xlab("Living area") +
  ylab("Number of Alex G Concerts") +
  theme(legend.position = "none") 

no_out_conc <- alex_g_conc_dat %>% filter(alex_g_concerts < 10)


ggplot(no_out_conc, aes(y = alex_g_concerts, x = living_area, fill = living_area)) +
   geom_violin() +
  geom_sina() +
  coord_flip() +
  scale_fill_brewer(palette = "Set1") +
  stat_n_text(size = 3)  +
  theme_classic() +
  labs(fill = "Area of Living") +
  theme(legend.position = "none") +
  labs(
    title = "Does seeing Alex G concerts depend on
how close you are to a city?",
    subtitle = paste0("N = ", nrow(no_out_conc))
  ) +
  xlab("Living area") +
  ylab("Number of Alex G Concerts")

ggsave(filename = "outputs/alex_g_where_live.png", width = hght, height = wdth)

Violin plot of concert frequency and number of Alex G concerts

ggplot(alex_g_conc_dat, aes(y = alex_g_concerts, x = concerts, fill = concerts)) +
   geom_violin() +
  geom_sina() + 
  coord_flip() +
  stat_n_text(size = 3)+
  theme_classic()  +
  theme(legend.position = "none") 

ggplot(no_out_conc, aes(y = alex_g_concerts, x = concerts, fill = concerts)) +
  geom_violin() +
  geom_sina() +
  coord_flip() +
  scale_fill_brewer(palette = "Dark2") +
  theme(legend.position = "none") +
  stat_n_text(size = 3) +
  labs(
    title = "How many Alex G concerts have you seen
depending on your concert habits",
    subtitle = paste0("N = ", nrow(no_out_conc))
  ) +
  xlab("Concert Habits") +
  ylab("Number of Alex G Concerts") +
  theme_classic()  +
  theme(legend.position = "none") 

ggsave(filename = "outputs/alex_g_conc_hab.png", width = hght, height = wdth)

That “once a year if ever” dude with 40 Alex G concerts must be 40 then…. something is telling me that might’ve been a mistake.

I decided to use some linear models to check if any of these variable were good predictors

conc_lm <- lm(formula = alex_g_concerts ~ living_area + `Lives Near Philly?` + concerts,
                data = alex_g_conc_dat)

sum_conc_lm <- summary(conc_lm)
sum_conc_lm$coefficients
##                                               Estimate Std. Error       t value
## (Intercept)                               1.744968e-13   3.740517  4.665046e-14
## living_areaIn the outskirts of a city    -1.999617e-01   1.313950 -1.521837e-01
## living_areaIn the suburbs                 1.120113e-01   1.256890  8.911790e-02
## living_areaIn a city                      1.381653e+00   1.198747  1.152581e+00
## `Lives Near Philly?`Yes                   3.707999e+00   1.062184  3.490920e+00
## concertsOnce a year if ever               2.093536e+00   3.979742  5.260482e-01
## concertsEvery 3-4 months                  9.088468e-01   3.916566  2.320520e-01
## concertsEvery month or every other month  1.736156e+00   3.951054  4.394159e-01
## concertsEvery two weeks                   6.820543e-02   4.059121  1.680301e-02
## concertsWhenever I can                    1.572409e+00   3.932301  3.998699e-01
##                                              Pr(>|t|)
## (Intercept)                              1.0000000000
## living_areaIn the outskirts of a city    0.8791954668
## living_areaIn the suburbs                0.9290773914
## living_areaIn a city                     0.2504583279
## `Lives Near Philly?`Yes                  0.0005922191
## concertsOnce a year if ever              0.5994375140
## concertsEvery 3-4 months                 0.8167349029
## concertsEvery month or every other month 0.6608346103
## concertsEvery two weeks                  0.9866105225
## concertsWhenever I can                   0.6896790533

It seems that the best predictor in this case is living near Philly! But that’s not an honest analysis, because this result is likely driven by outliers, so we’re going to toss them.

  ## Without the outliers


conc_lm <- lm(formula = alex_g_concerts ~ living_area + `Lives Near Philly?` + concerts,
                data = no_out_conc)

sum_conc_lm <- summary(conc_lm)
sum_conc_lm$coefficients
##                                               Estimate Std. Error       t value
## (Intercept)                              -1.407823e-15  1.7205786 -8.182264e-16
## living_areaIn the outskirts of a city     3.009316e-01  0.6048521  4.975292e-01
## living_areaIn the suburbs                 3.222993e-01  0.5786990  5.569377e-01
## living_areaIn a city                      8.150397e-01  0.5520281  1.476446e+00
## `Lives Near Philly?`Yes                   1.220924e+00  0.5274623  2.314714e+00
## concertsOnce a year if ever               5.119795e-01  1.8319731  2.794689e-01
## concertsEvery 3-4 months                  1.265932e+00  1.8016165  7.026646e-01
## concertsEvery month or every other month  1.698775e+00  1.8178097  9.345178e-01
## concertsEvery two weeks                   8.652422e-01  1.8676406  4.632809e-01
## concertsWhenever I can                    1.397884e+00  1.8090075  7.727353e-01
##                                            Pr(>|t|)
## (Intercept)                              1.00000000
## living_areaIn the outskirts of a city    0.61937358
## living_areaIn the suburbs                0.57820542
## living_areaIn a city                     0.14142904
## `Lives Near Philly?`Yes                  0.02166441
## concertsOnce a year if ever              0.78017970
## concertsEvery 3-4 months                 0.48309847
## concertsEvery month or every other month 0.35118676
## concertsEvery two weeks                  0.64367706
## concertsWhenever I can                   0.44060990

So it still appears that the biggest determinant of seeing Alex G concerts is if you live near Philly, which makes sense as, before Alex was touring most of his shows were in that area (and people could’ve included Skin Cells concerts as Alex G concerts). That said, if you remove the outliers, in the dataset, living near Philly is no longer as strong a predictor on how many Alex G concerts you’ve seen, but is still the biggest one.

ggplot(no_out_conc, aes(y = alex_g_concerts, x = `Lives Near Philly?`, fill = `Lives Near Philly?`)) +
   geom_violin() + 
  geom_jitter(weight = .25, height = 0.2) +
  coord_flip() +
  scale_fill_brewer(palette = "Dark2") + 
  theme_classic() +
  theme(legend.position = "none") +
  stat_n_text(size = 3) +
  labs(
    title = "How many Alex G concerts have you seen
based on living near Philly",
    subtitle = paste0("N = ", nrow(no_out_conc))
  ) +
  xlab("Live near Philly?") +
  ylab("Number of Alex G Concerts")

Book and Movie Genres

I didn’t know where this section fit best, so I just put it here. I didn’t really have any hypotheses here either, but I hope you enjoy the data!

bovie_dat <- dat %>% select(age, movie_genre, book_genre) %>%
  filter(movie_genre != "", book_genre != "")

movie_plot <- ggplot(bovie_dat, aes(movie_genre)) +
      geom_bar() +
     coord_flip() +
      xlab("Movie Genres") +
      theme_classic() +
      geom_text(stat='count', aes(label=..count..), position = position_dodge(width = 1), hjust = -.08) +
      labs(title = "Favorite Movie Genres",
           subtitle = paste0("N = ",nrow(bovie_dat)))

movie_plot

book_plot <- ggplot(bovie_dat, aes(book_genre)) +
        geom_bar() +
        coord_flip() +
        xlab("Book Genres") +
        theme_classic()  +
        geom_text(stat='count', aes(label=..count..), position = position_dodge(width = 1), hjust = -.08) +
        labs(title = "Favorite Book Genres",
             subtitle = paste0("N = ",nrow(bovie_dat)))

book_plot

auto_ztable(bovie_dat,
          "movie_genre",
          "book_genre",
          "Heatmap Table of book and movie genres",
          simul_p = T)
Heatmap Table of book and movie genres ; Chi-Sqare p-value of 0.316
  Crime Drama Fantasy Historical Fiction Horror I don't read books Mystery Nonfiction Poetry Romance Satire Sci-Fi Suspense/Thriller
Action 0 1 0 0 0 0 0 1 0 0 1 1 0
Adventure 0 0 1 1 0 0 0 2 0 0 0 1 0
Art House 0 3 4 3 1 1 0 13 5 0 3 4 1
Comedy 0 4 2 1 1 1 0 8 2 0 2 2 1
Documentary 0 0 0 0 0 0 1 2 3 0 2 1 0
Drama 1 14 12 3 0 6 1 8 4 1 1 4 0
Horror 1 4 1 0 2 4 0 7 2 2 2 4 2
I don’t watch movies 0 0 0 2 0 1 0 0 1 0 0 1 0
Romantic comedy 0 2 2 0 0 0 0 3 3 1 1 0 0
Suspense/Thriller 1 1 3 0 1 2 3 5 2 0 1 4 2

College/Major Analysis

Another question we asked is what type of college major did people have, or if they even decided to attend college which we look at below.

college_dat <- dat %>% mutate(id = row_number()) %>%
                    select(id, col_major) 

college_dat_melt <- college_dat %>%
  mutate(col_major = (str_split(col_major, ","))) %>%
  unnest(cols = col_major) %>%
  mutate(col_major =(str_trim(col_major, "both"))) %>%
  filter(!(col_major %in% c("Used to","but not anymore", "", "Nope")))


#General stats
sum_major <- as.data.frame(table(college_dat_melt$col_major),
                           stringsAsFactors = F) %>%
              mutate(Var1 = str_to_title(Var1))

#Check data looks okay
#unique(band_dat_melt$col_major)

#Make dataframe giving a 1 or 0 to each person based on if they play the col_major or not, 1 is TRUE and 0 is false
college_df <- (as.data.frame.matrix(table(college_dat_melt$id, college_dat_melt$col_major)))
#Make Instrument strings look like titles
colnames(college_df) <- str_to_title(colnames(college_df))

degree = 1
degree_list <- list()
#Probably should change this to a lapply, but we use this loop to generate a list of ids that play a given instrument
for (degree in 1:ncol(college_df)) {
   degree_list[[colnames(college_df)[degree]]] <- (1:nrow(college_df))[college_df[,degree] == 1]
}

Upset plot for different majors to see what double majors we have

plot_degree <- upset(fromList(degree_list),
                 order.by = "freq",
                 nsets =20, 
                 #point.size = 4,
                 #line.size = 1.8,
                 text.scale=c(2.8, 1.6, 1.8, 1.6, 1.3, 2),
                 mainbar.y.label = "Major Intersections",
                 sets.x.label = "People Per Major"
                 )

plot_degree

two_degrees <- sum(rowSums(college_df) == 2)
three_degrees <- sum(rowSums(college_df) == 3)
four_degrees <- sum(rowSums(college_df) == 4)

We have 23 people with 2 majors, 4 people with three majors, and 1 with four majors.

The most average Alex G person

library(lubridate)
# x = dat$country
# rank = 1
# to_low = T
ranked_top <- function(x, rank = 1, to_low = T){
  temp <- names(sort(table(x),decreasing=TRUE)[rank])
  if (to_low == T) {
    temp <- tolower(temp)
  }
  return(temp)
  } 

#When the average person filled out the survey
#mean(mdy_hm(dat$Timestamp))
#Average age
avg_age <- median(dat$age)
#Country
avg_country <- ranked_top(dat$country,to_low = F)
#State
avg_state <- ranked_top(dat$state, to_low = F, rank = 3)
#Race
avg_race <- ranked_top(dat$race)
#They are male
avg_gender <- ranked_top(dat$gender)
#They are cis
avg_identity <- ranked_top(dat$transgender)
#They are straight
avg_orientation <- ranked_top(dat$orientation)
#They are monogomous
avg_poly <- ranked_top(dat$polyamorous)
#They started listening to alex G at the age of 19
avg_start_listen <- median(dat$first_age)
#They've been to 2 Alex G concerts
avg_alex_conc <- median(dat$alex_g_concerts, na.rm = T)
# They found Alex G through a friend
avg_discovery <- ranked_top(dat$discover_method)
#Their favorite song is Snot
avg_fave_song <- ranked_top(dat$favorite_song, to_low = FALSE)
#Their favorite album is Trick
avg_fave_album <- ranked_top(dat$favorite_album, to_low = FALSE)
#They don't listen to Alex G's unreleased albums
avg_unreleased <- ranked_top(dat$unreleased_album)
#They like the gretel music video
avg_music_vid <- ranked_top(dat$music_vid, to_low = FALSE)
#They listen to music via spotify
avg_music_source <- ranked_top(dat$music_method)
#They probably play guitar
avg_instrument <- (sum_instr %>% arrange(desc(Freq)))[1,1]
#They are not in a band
avg_band <- ranked_top(dat$in_band)
#Favorite musician
avg_musician <- ranked_top(musician_dat$favorite_musician, to_low = FALSE)
#Second fave musician
second_musician <- ranked_top(musician_dat$sec_musician, rank = 1, to_low = FALSE)
#Non Alex musician
second_2_musician <- ranked_top(musician_dat$sec_musician, rank = 2, to_low = FALSE)
####NEED TO DO ART
avg_art <- (sum_art %>% arrange(desc(Freq)))[1,1]
#Major
avg_major <- (sum_major %>% arrange(desc(Freq)))[1,1]
##How often they go to concerts
avg_concert_go <- ranked_top(dat$concerts)
# Weed
avg_weed <- ranked_top(dat$weed_use)
#Alcohol 
avg_alcohol <- ranked_top(dat$alc_use)
#Living area
avg_living <- ranked_top(dat$living_area)
#authoritarian_lib
med_lib_auth <- median(dat$lib_auth, na.rm = T)
#Left right
med_left_right <- median(dat$left_right, na.rm = T)
#MBTI
med_meyers <- ranked_top(dat$mbti, to_low = F)
#Harry potter
med_harry <- ranked_top(dat$hp_house)
#Astro-sign
med_sign <- ranked_top(dat$sign, to_low = F)

So around 2020-05-08 09:57:30 a white heterosexual/straight male who is 23 years old decided to fill out the survey. They live in United States, specifically in California and when asked if they prefer multiple romantic partners they generally say no. Their favorite musician is Alex G, when you ask them their second favorite musician they’ll emphasize how much they like Alex G, before balking and saying Modest Mouse. They have a lot of opinions about Alex G, their favorite song is Snot, their favorite music video is Gretel, and their favorite album is Trick. That said, they when asked about what their favorite fan compilation is they reply, i don’t listen to them., which is probably because they listen to most of their music through spotify which is somewhat difficult to get non-official tracks on. Their favorite song is definitely Snot

Nevertheless, they’re still a big Alex G fan, they started listening to him when they were 19 years old, and have been to 2 concerts so far. When ask what they like to do in their spare time they have a couple hobbies. They like to play the Guitar, but when asked if they’re in a band, they’d say no. They also use writing to express themselves, and when it comes to concerts they go whenever i can. They are studying/have studied the Stem Field and live in a city. When it comes to weed they would say “i used to smoke weed, but now i don’t” and with alcohol they’d say they’d drink “once a month/socially”.

When it comes to their political leanings, on the scale of 1 to 10 from left politics to right, and from libertarian to authoritarian, they fall around 2 and 3 respectively. Their personality, if you asked them arbitrarily their MBTI, astrological sign, and Harry Potter House they would also tell you, respectively, INFP, Gemini, and “i don’t know”. They find all of these questions you’re asking them kinda weird, almost like this format that I decided to explain the most average Alex G fan didn’t pan out the way I wanted, but they know everyone is just trying their best.

Special Thanks!

Modmin Team of 666posting

In the order Facebook decided:

  • Ignacy R.
  • Isabelle J.
  • Abby C.
  • Armando M.
  • Cody G.
  • Ashtyn K.
  • Marley M.
  • Kara B.

Special shoutout to Isabelle for allowing and promoting the survey.

Other thanks in order of commenting on the original post

  • Zion H.: Inspiring the initial graph and indirectly the survey
  • Miguel Z.: Making the first preliminary visualizations
  • Alex J.: Favorite movie and book genres questions
  • Rivers F.: Consistent support throughout the process
  • Andrew W.: Consistent support as well
  • Fabiana O.: Suggested questions on race and queerness
  • Hillary N.: Suggested astrological signs questions
  • Luke D.: Suggested political stance questions
  • Gwendylan T.: Suggested weed usage and Meyer-Briggs type questions
  • Abigail C.: For testing the survey and giving constructive criticism
  • Luke P.: Allowing me to link his video

Helpful friends not of the group

  • Stephanie Y.: Gave critiques to the Rmarkdown and suggested spotifyr
  • Veronica B.: Suggested an appropriate statistical test for regional differences, and gave helpful criticisms
  • Kurt W. : Helpful criticisms as well.

One special thanks

Graham aka Grumpus for talking to me about Alex G back when I knew diddly squat besides Mary. Thank you for introducing me to one of my favorite artists of all time.

Things to Do

  • Shiny